From a639a53b0b85f1516c9d09e7eaacc0a44598b461 Mon Sep 17 00:00:00 2001 From: Mikkel Georgsen Date: Wed, 4 Feb 2026 13:50:03 +0000 Subject: [PATCH] docs: add codebase map and domain research Codebase: 7 documents (stack, architecture, structure, conventions, testing, integrations, concerns) Research: 5 documents (stack, features, architecture, pitfalls, summary) --- .planning/codebase/ARCHITECTURE.md | 151 ++++++++++++++ .planning/codebase/CONCERNS.md | 272 ++++++++++++++++++++++++ .planning/codebase/CONVENTIONS.md | 274 ++++++++++++++++++++++++ .planning/codebase/INTEGRATIONS.md | 261 +++++++++++++++++++++++ .planning/codebase/STACK.md | 152 ++++++++++++++ .planning/codebase/STRUCTURE.md | 228 ++++++++++++++++++++ .planning/codebase/TESTING.md | 324 +++++++++++++++++++++++++++++ 7 files changed, 1662 insertions(+) create mode 100644 .planning/codebase/ARCHITECTURE.md create mode 100644 .planning/codebase/CONCERNS.md create mode 100644 .planning/codebase/CONVENTIONS.md create mode 100644 .planning/codebase/INTEGRATIONS.md create mode 100644 .planning/codebase/STACK.md create mode 100644 .planning/codebase/STRUCTURE.md create mode 100644 .planning/codebase/TESTING.md diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md new file mode 100644 index 0000000..c9ba393 --- /dev/null +++ b/.planning/codebase/ARCHITECTURE.md @@ -0,0 +1,151 @@ +# Architecture + +**Analysis Date:** 2026-02-04 + +## Pattern Overview + +**Overall:** Hub-and-spoke service orchestration with API-driven infrastructure management. + +**Key Characteristics:** +- Centralized management container (VMID 102 - mgmt) coordinating all infrastructure +- Layered abstraction: CLI helpers → REST APIs → external services +- Event-driven notifications (Telegram bot bridges management layer to user) +- Credential-based authentication for all service integrations + +## Layers + +**Management Layer:** +- Purpose: Orchestration and automation entry point for the homelab +- Location: `/home/mikkel/homelab` (git repository in mgmt container) +- Contains: CLI helper scripts (`~/bin/*`), Telegram bot, documentation +- Depends on: Remote SSH access to container/VM IP addresses, Proxmox API, service REST APIs +- Used by: Claude Code automation, Telegram bot commands, cron jobs + +**API Integration Layer:** +- Purpose: Abstracts service APIs into simple CLI interfaces +- Location: `~/bin/` (pve, npm-api, dns, pbs, beszel, kuma, updates, telegram) +- Contains: Python and Bash wrappers around external service APIs +- Depends on: Proxmox API, Nginx Proxy Manager API, Technitium DNS API, PBS REST API, Beszel PocketBase, Uptime Kuma REST API, Telegram Bot API +- Used by: Telegram bot, CI/CD automation, interactive CLI usage + +**Service Layer:** +- Purpose: Individual hosted services providing infrastructure capabilities +- Location: Distributed across containers (NPM, DNS, PBS, Dockge, Forgejo, etc.) +- Contains: Docker containers, LXC services, backup systems +- Depends on: PVE host networking, shared storage, external integrations +- Used by: API layer, end-user access via web UI or CLI + +**Data & Communication Layer:** +- Purpose: State persistence and inter-service communication +- Location: Shared storage (`~/stuff` - ZFS bind mount), credential files (`~/.config/*/credentials`) +- Contains: Backup data, configuration files, Telegram inbox/images/files +- Depends on: PVE ZFS dataset, filesystem access +- Used by: All services, backup/restore operations + +## Data Flow + +**Infrastructure Query Flow (e.g., `pve list`):** + +1. User invokes CLI helper: `~/bin/pve list` +2. Helper loads credentials from `~/.config/pve/credentials` +3. Helper authenticates to Proxmox API at `core.georgsen.dk:8006` using token auth +4. Proxmox returns cluster resource state (VMs/containers) +5. Helper formats and displays output to user + +**Service Management Flow (e.g., `dns add myhost 10.5.0.50`):** + +1. User invokes: `~/bin/dns add myhost 10.5.0.50` +2. DNS helper loads credentials and authenticates to Technitium at `10.5.0.2:5380` +3. Helper makes HTTP API call to add A record +4. Technitium stores in zone file and updates DNS records +5. Helper confirms success to user + +**Backup Status Flow (e.g., `/pbs` command in Telegram):** + +1. Telegram user sends `/pbs` command +2. Bot handler in `telegram/bot.py` executes `~/bin/pbs status` +3. PBS helper SSH's to `10.5.0.6` as root +4. SSH command reads backup logs and GC status from PBS container +5. Helper formats human-readable output +6. Bot sends result back to Telegram chat (truncated to 4000 chars for Telegram API limit) + +**State Management:** +- Credentials: Stored in `~/.config/*/credentials` files (sourced at runtime) +- Telegram messages: Appended to `telegram/inbox` file for Claude to read +- Media uploads: Saved to `telegram/images/` and `telegram/files/` with timestamps +- Authorization: `telegram/authorized_users` file maintains allowlist of chat IDs + +## Key Abstractions + +**Helper Scripts (API Adapters):** +- Purpose: Translate user intent into remote service API calls +- Examples: `~/bin/pve`, `~/bin/dns`, `~/bin/pbs`, `~/bin/beszel`, `~/bin/kuma` +- Pattern: Load credentials → authenticate → execute command → format output +- Language: Mix of Python (pve, updates, telegram) and Bash (dns, pbs, beszel, kuma) + +**Telegram Bot:** +- Purpose: Provides two-way interactive access to management functions +- Implementation: `telegram/bot.py` using python-telegram-bot library +- Pattern: Command handlers dispatch to helper scripts, results sent back to user +- Channels: Commands (e.g., `/pbs`), free-text messages saved to inbox, photos/files downloaded + +**Service Registry (Documentation):** +- Purpose: Centralized reference for service locations and access patterns +- Implementation: `homelab-documentation.md` and `CLAUDE.md` +- Contents: IP addresses, ports, authentication methods, SSH targets, network topology + +## Entry Points + +**CLI Usage (Direct):** +- Location: `~/bin/{helper}` scripts +- Triggers: Manual invocation by user or cron jobs +- Responsibilities: Execute service operations, format output, validate inputs + +**Telegram Bot:** +- Location: `telegram/bot.py` (systemd service: `telegram-bot.service`) +- Triggers: Telegram message or command from authorized user +- Responsibilities: Authenticate user, route command/message, execute via helper scripts, send response + +**Automation Scripts:** +- Location: Potential cron jobs or scheduled tasks +- Triggers: Time-based scheduling +- Responsibilities: Execute periodic management tasks (e.g., backup checks, updates) + +**Manual Execution:** +- Location: Interactive shell in mgmt container +- Triggers: User SSH session +- Responsibilities: Run helpers for ad-hoc infrastructure management + +## Error Handling + +**Strategy:** Graceful degradation with informative messaging. + +**Patterns:** +- CLI helpers return non-zero exit codes on failure (exception handling in Python, `set -e` in Bash) +- Timeout protection: Telegram bot commands have 30-second timeout (configurable per command) +- Service unavailability: Caught in try/except blocks, fall back to next option (e.g., `pve` tries LXC first, then QEMU) +- Credential failures: Load-time validation, clear error message if credentials file missing +- Network errors: SSH timeouts, API connection failures logged to stdout/stderr + +## Cross-Cutting Concerns + +**Logging:** +- Telegram bot uses Python stdlib logging (INFO level, writes to systemd journal) +- CLI helpers write directly to stdout/stderr +- PBS helper uses SSH error output for remote command failures + +**Validation:** +- Telegram bot validates hostnames (alphanumeric + dots + hyphens only) before ping +- DNS helper validates that name and IP are provided before API call +- PVE helper validates VMID is integer before API call + +**Authentication:** +- Credentials stored in `~/.config/{service}/credentials` as simple key=value files +- Sourced at runtime (Bash) or read at startup (Python) +- Token-based auth for Proxmox (no password in memory) +- Basic auth for DNS and other REST APIs (credentials URL-encoded if needed) +- Bearer token for Uptime Kuma (API key-based) + +--- + +*Architecture analysis: 2026-02-04* diff --git a/.planning/codebase/CONCERNS.md b/.planning/codebase/CONCERNS.md new file mode 100644 index 0000000..75681e5 --- /dev/null +++ b/.planning/codebase/CONCERNS.md @@ -0,0 +1,272 @@ +# Codebase Concerns + +**Analysis Date:** 2026-02-04 + +## Tech Debt + +**IP Addressing Scheme Inconsistency:** +- Issue: Container IPs don't follow VMID convention. NPM (VMID 100) is at .1, Dockge (VMID 101) at .10, PBS (VMID 106) at .6, instead of matching .100, .101, .106 +- Files: `homelab-documentation.md` (lines 139-159) +- Impact: Manual IP tracking required, DNS records must be maintained separately, new containers require manual IP assignment planning, documentation drift risk +- Fix approach: Execute TODO task to reorganize vmbr1 to VMID=IP scheme (.100-.253 range), update NPM proxy hosts, DNS records (lab.georgsen.dk), and documentation + +**DNS Record Maintenance Manual:** +- Issue: Internal DNS (Technitium) and external DNS (dns.services) require manual updates when IPs/domains change +- Files: `homelab-documentation.md` (lines 432-449), `~/bin/dns` script +- Impact: Risk of records becoming stale after IP migrations, no automation for new containers +- Fix approach: Implement `dns-services` helper script (TODO.md line 27) with API integration for automatic updates + +**Unimplemented Helper Scripts:** +- Issue: `dns-services` API integration promised in TODO but not implemented +- Files: `TODO.md` (line 27), `dns-services/credentials` exists but script doesn't +- Impact: Manual dns.services operations required, cannot automate domain setup +- Fix approach: Create `~/bin/dns-services` wrapper (endpoint documented in TODO) + +**Ping Capability Missing on 12 Containers:** +- Issue: Unprivileged LXC containers drop cap_net_raw, breaking ping on VMIDs 100, 101, 102, 103, 104, 105, 107, 108, 110, 111, 112, 114, 115, 1000 +- Files: `TODO.md` (lines 31-33), `CLAUDE.md` (line 252-255) +- Impact: Health monitoring fails, network diagnostics broken, Telegram bot status checks incomplete (bot has no ping on home network itself), Uptime Kuma monitors may show false negatives +- Fix approach: Run `setcap cap_net_raw+ep /bin/ping` on each container (must be reapplied after iputils-ping updates) + +**Version Pinning Warnings:** +- Issue: CLAUDE.md section 227-241 warns about hardcoded versions becoming stale +- Files: `homelab-documentation.md` (lines 217, 228, 239), `~/bin/updates` script shows version checking is implemented but some configs have `latest` tags +- Impact: Security patch delays, incompatibilities when manually deploying services +- Fix approach: Always query GitHub API for latest versions (updates script does this correctly for discovery phase) + +## Known Bugs + +**Telegram Bot Inbox Storage Race Condition:** +- Symptoms: Concurrent message writes could corrupt inbox file, messages may be lost +- Files: `telegram/bot.py` (lines 39, 200-220 message handling), `~/bin/telegram` (lines 73-79 clear command) +- Trigger: Multiple rapid messages from admin or concurrent bot operations +- Workaround: Clear inbox frequently and check for corruption; bot currently appends to file without locking +- Root cause: File-based inbox with no atomic writes or mutex protection + +**PBS Backup Mount Dependency Not Enforced:** +- Symptoms: PBS services may start before Synology CIFS mount is available, backup path unreachable +- Files: `homelab-documentation.md` (lines 372-384), container 106 config +- Trigger: System reboot when Tailscale connectivity is delayed +- Workaround: Manual restart of proxmox-backup-proxy and proxmox-backup services +- Root cause: systemd dependency chain `After=mnt-synology.mount` doesn't guarantee mount is ready at service start time + +**DragonflyDB Password in Plain Text in Documentation:** +- Symptoms: Database password visible in compose file and documentation +- Files: `homelab-documentation.md` (lines 248-250) +- Trigger: Anyone reading docs or inspecting git history +- Workaround: Consider password non-critical if container only accessible on internal network +- Root cause: Password stored in version control and documentation rather than .env or secrets file + +**NPM Proxy Host 18 (mh.datalos.dk) Not Configured:** +- Symptoms: Domain not resolving despite DNS record missing and NPM entry (ID 18) mentioned in TODO +- Files: `TODO.md` (line 29), `homelab-documentation.md` (proxy hosts section) +- Trigger: Accessing mh.datalos.dk from browser +- Workaround: Must be configured manually via NPM web UI +- Root cause: Setup referenced in TODO but not completed + +## Security Considerations + +**Exposed Credentials in Git History:** +- Risk: Credential files committed (credentials, SSH keys, telegram token examples) +- Files: All credential files in `telegram/`, `pve/`, `forgejo/`, `dns/`, `dockge/`, `uptime-kuma/`, `beszel/`, `dns-services/` directories (8+ files) +- Current mitigation: Files are .gitignored in main repo but present in working directory +- Recommendations: Rotate all credentials listed, audit git log for historical commits, use HashiCorp Vault or pass for credential storage, document secret rotation procedure + +**Public IP Hardcoded in Documentation:** +- Risk: Home IP 83.89.248.247 exposed in multiple locations +- Files: `homelab-documentation.md` (lines 98, 102), `CLAUDE.md` (line 256) +- Current mitigation: IP is already public/static, used for whitelist access +- Recommendations: Document that whitelisting this IP is intentional, no other PII mixed in + +**Telegram Bot Authorization Model Too Permissive:** +- Risk: First user to message bot becomes admin automatically with no verification +- Files: `telegram/bot.py` (lines 86-95) +- Current mitigation: Bot only responds to authorized user, requires bot discovery +- Recommendations: Require multi-factor authorization on first start (e.g., PIN from environment variable), implement audit logging of all bot commands + +**Database Credentials in Environment Variables:** +- Risk: DragonflyDB password passed via Docker command line (visible in `docker ps`, logs, process listings) +- Files: `homelab-documentation.md` (line 248) +- Current mitigation: Container only accessible on internal vmbr1 network +- Recommendations: Use Docker secrets or mounted .env files instead of command-line arguments + +**Synology CIFS Credentials in fstab:** +- Risk: SMB credentials stored in plaintext in fstab file with mode 0644 (world-readable) +- Files: `homelab-documentation.md` (line 369) +- Current mitigation: Mounted on container-only network, requires PBS container access +- Recommendations: Use credentials file with mode 0600, rotate credentials regularly, monitor file permissions + +**SSH Keys Included in Documentation:** +- Risk: Public SSH keys hardcoded in CLAUDE.md setup examples +- Files: `CLAUDE.md` and `homelab-documentation.md` SSH key examples +- Current mitigation: Public keys only (not private), used for container access +- Recommendations: Rotate these keys if documentation is ever exposed, don't include in public repos + +## Performance Bottlenecks + +**Single NVMe Storage (RAID0) Without Local Redundancy:** +- Problem: Core server has 2x1TB NVMe in RAID0 (striped, no redundancy) +- Files: `homelab-documentation.md` (lines 17-24) +- Cause: Cost optimization for Hetzner dedicated server +- Impact: Single drive failure = total data loss; database corruption risk from RAID0 stripe inconsistency +- Improvement path: (1) Ensure PBS backups run successfully to Synology, (2) Test backup restore procedure monthly, (3) Plan upgrade path if budget allows (3-way mirror or RAID1) + +**Backup Dependency on Single Tailscale Gateway:** +- Problem: All PBS backups to Synology go through Tailscale relay (10.5.0.134), single point of failure +- Files: `homelab-documentation.md` (lines 317-427) +- Cause: Synology only accessible via Tailscale network, relay container required +- Impact: Tailscale relay downtime = backup failure; no local backup option +- Improvement path: (1) Add second Tailscale relay for redundancy, (2) Explore PBS direct SSH backup mode, (3) Monitor relay container health + +**DNS Queries All Route Through Single Technitium Container:** +- Problem: All internal DNS (lab.georgsen.dk) goes through container 115, DHCP defaults to this server +- Files: `homelab-documentation.md` (lines 309-315), container config +- Cause: Single container architecture +- Impact: DNS outage = network unreachable (containers can't resolve any hostnames) +- Improvement path: (1) Deploy DNS replica on another container, (2) Configure DHCP to use multiple DNS servers, (3) Set upstream DNS fallback + +**Script Execution via Telegram Bot with Subprocess Timeout:** +- Problem: Bot runs helper scripts with 30-second timeout, commands like PBS backup query can exceed limit +- Files: `telegram/bot.py` (lines 60-78, 191) +- Cause: Helper scripts do remote SSH execution, network latency variable +- Impact: Commands truncated mid-execution, incomplete status reports, timeouts on slow networks +- Improvement path: Increase timeout selectively, implement command queuing, cache results for frequently-called commands + +## Fragile Areas + +**Installer Shell Script with Unimplemented Sections:** +- Files: `pve-homelab-kit/install.sh` (495+ lines with TODO comments) +- Why fragile: Multiple TODO placeholders indicate incomplete implementation; wizard UI done but ~30 implementation TODOs remain +- Safe modification: (1) Don't merge branches without running through full install, (2) Test each section independently, (3) Add shell `set -e` error handling +- Test coverage: Script has no tests, no dry-run mode, no rollback capability + +**Container Configuration Manual in LXC Config Files:** +- Files: `/etc/pve/lxc/*.conf` across Proxmox host (not in repo, not version controlled) +- Why fragile: Critical settings (features, ulimits, AppArmor) outside version control, drift risk after manual fixes +- Safe modification: Keep backup copies in `homelab-documentation.md` (already done for PBS), automate via Terraform/Ansible if future containers added +- Test coverage: Config changes only tested on live container (no staging env) + +**Helper Scripts with Hardcoded IPs and Paths:** +- Files: `~/bin/updates` (lines 16-17, 130), `~/bin/pbs`, `~/bin/pve`, `~/bin/dns` +- Why fragile: DOCKGE_HOST, PVE_HOST hardcoded; if IPs change during migration, all scripts must be updated manually +- Safe modification: Extract to config file (e.g., `/etc/homelab/config.sh` or environment variables) +- Test coverage: Scripts tested against live infrastructure only + +**SSH-Based Container Access Without Key Verification:** +- Files: `~/bin/updates` (lines 115-131), scripts use `-q` flag suppressing host key checks +- Why fragile: `ssh -q` disables StrictHostKeyChecking, vulnerable to MITM; scripts assume SSH keys are pre-installed +- Safe modification: Add `-o StrictHostKeyChecking=accept-new` to verify on first connection, document key distribution procedure +- Test coverage: SSH connectivity assumed working + +**Backup Monitoring Without Alerting on Failure:** +- Files: `~/bin/pbs`, `telegram/bot.py` (status command only, no automatic failure alerts) +- Why fragile: Failed backups only visible if manually checked; no monitoring of backup completion +- Safe modification: Add systemd timer to check PBS status hourly, send Telegram alert on failure +- Test coverage: Manual checks only + +## Scaling Limits + +**Container IP Space Exhaustion:** +- Current capacity: vmbr1 is /24 (256 IPs, .0-.255), DHCP range .100-.200 (101 IPs available for DHCP), static IPs scattered +- Limit: After ~150 containers, IP fragmentation becomes difficult to manage; DHCP range conflicts with static allocation +- Scaling path: (1) Implement TODO IP scheme (VMID=IP), (2) Expand to /23 (512 IPs) if more containers needed, (3) Use vmbr2 (vSwitch) for secondary network + +**Backup Datastore Single Synology Volume:** +- Current capacity: Synology `pbs-backup` share unknown size (not documented) +- Limit: Unknown when share becomes full; no warning system implemented +- Scaling path: (1) Document share capacity in homelab-documentation.md, (2) Add usage monitoring to `beszel` or Uptime Kuma, (3) Plan expansion to second NAS + +**Dockge Stack Limit:** +- Current capacity: Dockge container 101 running ~8-10 stacks visible in documentation +- Limit: No documented resource constraints; may hit CPU/RAM limits on Hetzner AX52 with more containers +- Scaling path: (1) Monitor Dockge resource usage via Beszel, (2) Profile Dragonfly memory usage, (3) Plan VM migration for heavy workloads + +**DNS Query Throughput:** +- Current capacity: Single Technitium container handling all internal DNS +- Limit: Container CPU/RAM limits unknown; no QPS monitoring +- Scaling path: (1) Add DNS replica, (2) Monitor query latency, (3) Profile Technitium logs for slow queries + +## Dependencies at Risk + +**Technitium DNS (Unmaintained Risk):** +- Risk: TechnitiumSoftware/DnsServer has irregular commit history; last significant release early 2024 +- Impact: Security fixes may be delayed; compatibility with newer Linux kernels unknown +- Migration plan: (1) Profile current Technitium features used, (2) Evaluate CoreDNS or Dnsmasq alternatives, (3) Plan gradual migration with dual DNS + +**DragonflyDB as Redis Replacement:** +- Risk: Dragonfly smaller ecosystem than Redis; breaking changes possible in minor updates +- Impact: Applications expecting Redis behavior may fail; less community support for issues +- Migration plan: (1) Pin Dragonfly version in compose file (currently `latest`), (2) Test upgrades in dev environment, (3) Document any API incompatibilities found + +**Dockge (Single Maintainer Project):** +- Risk: Dockge maintained by one developer (louislam); bus factor high +- Impact: If maintainer loses interest, fixes and features stop; dependency on their release schedule +- Migration plan: (1) Use Dockge for UI only, don't depend on it for production orchestration, (2) Keep docker-compose expertise on team, (3) Consider Portainer as fallback alternative + +**Forgejo (Younger than Gitea):** +- Risk: Forgejo is recent fork of Gitea; database schema changes possible in patch versions +- Impact: Upgrades may require manual migrations; data loss risk if migration fails +- Migration plan: (1) Test Forgejo upgrades on backup copy first, (2) Document upgrade procedure, (3) Keep Gitea as fallback if Forgejo breaks + +## Missing Critical Features + +**No Automated Health Monitoring/Alerting:** +- Problem: Status checks exist (via Telegram bot, Uptime Kuma) but no automatic alerts when services fail +- Blocks: Cannot sleep soundly; must manually check status to detect outages +- Implementation path: (1) Add Uptime Kuma HTTP monitors for all public services, (2) Create Telegram alert webhook, (3) Monitor PBS backup success daily + +**No Automated Certificate Renewal Verification:** +- Problem: NPM handles Let's Encrypt renewal, but no monitoring for renewal failures +- Blocks: Certificates could expire silently; discovered during service failures +- Implementation path: (1) Add Uptime Kuma alert for HTTP 200 on https://* services, (2) Add monthly certificate expiry check, (3) Set up renewal failure alerts + +**No Disaster Recovery Runbook:** +- Problem: Procedures for rescuing locked-out server (Hetzner Rescue Mode) not documented +- Blocks: If SSH access lost, cannot recover without external procedures +- Implementation path: (1) Document Hetzner Rescue Mode recovery steps, (2) Create network reconfiguration backup procedures, (3) Test rescue mode monthly + +**No Change Log / Audit Trail:** +- Problem: Infrastructure changes not logged; drift from documentation occurs silently +- Blocks: Unknown who made changes, when, and why; cannot track config evolution +- Implementation path: (1) Add git commit requirement for all manual changes, (2) Create change notification to Telegram, (3) Weekly drift detection report + +**No Secrets Management System:** +- Problem: Credentials scattered across plaintext files, git history, and documentation +- Blocks: Cannot safely share access with team members; no credential rotation capability +- Implementation path: (1) Deploy HashiCorp Vault or Vaultwarden, (2) Migrate all secrets to vault, (3) Create credential rotation procedures + +## Test Coverage Gaps + +**PBS Backup Restore Not Tested:** +- What's not tested: Full restore procedures; assumed to work but never verified +- Files: `homelab-documentation.md` (lines 325-392), no restore test documented +- Risk: If restore needed, may discover issues during actual data loss emergency +- Priority: HIGH - Add monthly restore test procedure (restore single VM to temporary location, verify data integrity) + +**Network Failover Scenarios:** +- What's not tested: What happens if Tailscale relay (1000) goes down, if NPM container restarts, if DNS returns SERVFAIL +- Files: No documented failure scenarios +- Risk: Unknown recovery time; applications may hang instead of failing gracefully +- Priority: HIGH - Document and test each service's failure mode + +**Helper Script Error Handling:** +- What's not tested: Scripts with SSH timeouts, host unreachable, malformed responses +- Files: `~/bin/updates`, `~/bin/pbs`, `~/bin/pve` (error handling exists but not tested against failures) +- Risk: Silent failures could go unnoticed; incomplete output returned to caller +- Priority: MEDIUM - Add error injection tests (mock SSH failures) + +**Telegram Bot Commands Under Load:** +- What's not tested: Bot response when running concurrent commands, or when helper scripts timeout +- Files: `telegram/bot.py` (no load tests, concurrency behavior unknown) +- Risk: Bot may hang or lose messages under heavy load +- Priority: MEDIUM - Add load test with 10+ concurrent commands + +**Container Migration (VMID IP Scheme Change):** +- What's not tested: Migration of 15+ containers to new IP scheme; full rollback procedures +- Files: `TODO.md` (line 5-15, planned but not executed) +- Risk: Single IP misconfiguration could take multiple services offline +- Priority: HIGH - Create detailed migration runbook with rollback at each step before executing + +--- + +*Concerns audit: 2026-02-04* diff --git a/.planning/codebase/CONVENTIONS.md b/.planning/codebase/CONVENTIONS.md new file mode 100644 index 0000000..420199f --- /dev/null +++ b/.planning/codebase/CONVENTIONS.md @@ -0,0 +1,274 @@ +# Coding Conventions + +**Analysis Date:** 2026-02-04 + +## Naming Patterns + +**Files:** +- Python files: lowercase with underscores (e.g., `bot.py`, `credentials`) +- Bash scripts: lowercase with hyphens (e.g., `npm-api`, `uptime-kuma`) +- Helper scripts in `~/bin/`: all lowercase, no extension (e.g., `pve`, `pbs`, `dns`) + +**Functions:** +- Python: snake_case (e.g., `cmd_status()`, `get_authorized_users()`, `run_command()`) +- Bash: snake_case with `cmd_` prefix for command handlers (e.g., `cmd_status()`, `cmd_tasks()`) +- Bash: auxiliary functions also use snake_case (e.g., `ssh_pbs()`, `get_token()`) + +**Variables:** +- Python: snake_case for local/module vars (e.g., `authorized_users`, `output_lines`) +- Python: UPPERCASE for constants (e.g., `TOKEN`, `INBOX_FILE`, `AUTHORIZED_FILE`, `NODE`, `PBS_HOST`) +- Bash: UPPERCASE for environment variables and constants (e.g., `PBS_HOST`, `TOKEN`, `BASE`, `DEFAULT_ZONE`) +- Bash: lowercase for local variables (e.g., `hours`, `cutoff`, `status_icon`) + +**Types/Classes:** +- Python: PascalCase for imported classes (e.g., `ProxmoxAPI`, `Update`, `Application`) +- Dictionary/config keys: lowercase with hyphens or underscores (e.g., `token_name`, `max-mem`) + +## Code Style + +**Formatting:** +- No automated formatter detected in codebase +- Python: PEP 8 conventions followed informally + - 4-space indentation + - Max line length ~90-100 characters (observed in practice) + - Blank lines: 2 lines before module-level functions, 1 line before methods +- Bash: 4-space indentation (observed) + +**Linting:** +- No linting configuration detected (no .pylintrc, .flake8, .eslintrc) +- Code style is manually maintained + +**Docstrings:** +- Python: Triple-quoted strings at module level describing purpose + - Example from `telegram/bot.py`: + ```python + """ + Homelab Telegram Bot + Two-way interactive bot for homelab management and notifications. + """ + ``` +- Python: Function docstrings used for major functions + - Single-line format for simple functions + - Example: `"""Handle /start command - first contact with bot."""` + - Example: `"""Load authorized user IDs."""` + +## Import Organization + +**Order:** +1. Standard library imports (e.g., `sys`, `os`, `json`, `subprocess`) +2. Third-party imports (e.g., `ProxmoxAPI`, `telegram`, `pocketbase`) +3. Local imports (rarely used in this codebase) + +**Path Aliases:** +- No aliases detected +- Absolute imports used throughout + +**Credential Loading Pattern:** +All scripts that need credentials follow the same pattern: +```python +# Load credentials +creds_path = Path.home() / ".config" / / "credentials" +creds = {} +with open(creds_path) as f: + for line in f: + if '=' in line: + key, value = line.strip().split('=', 1) + creds[key] = value +``` + +Or in Bash: +```bash +source ~/.config/dns/credentials +``` + +## Error Handling + +**Patterns:** +- Python: Try-except with broad exception catching (bare `except:` used in `pve` script lines 70, 82, 95, 101) + - Not ideal but pragmatic for CLI tools that need to try multiple approaches + - Example from `pve`: + ```python + try: + status = pve.nodes(NODE).lxc(vmid).status.current.get() + # ... + return + except: + pass + ``` + +- Python: Explicit exception handling in telegram bot + - Catches `subprocess.TimeoutExpired` specifically in `run_command()` function + - Example from `telegram/bot.py`: + ```python + try: + result = subprocess.run(...) + output = result.stdout or result.stderr or "No output" + if len(output) > 4000: + output = output[:4000] + "\n... (truncated)" + return output + except subprocess.TimeoutExpired: + return "Command timed out" + except Exception as e: + return f"Error: {e}" + ``` + +- Bash: Set strict mode with `set -e` in some scripts (`dns` script line 12) + - Causes script to exit on first error + +- Bash: No error handling in most scripts (`pbs`, `beszel`, `kuma`) + - Relies on exit codes implicitly + +**Return Value Handling:** +- Python: Functions return data directly or None on failure + - Example from `pbs` helper: Returns JSON-parsed data or string output + - Example from `pve`: Returns nothing (prints output), but uses exceptions for flow control + +- Python: Command runner returns error strings: `"Command timed out"`, `"Error: {e}"` + +## Logging + +**Framework:** +- Python: Standard `logging` module + - Configured in `telegram/bot.py` lines 18-22: + ```python + logging.basicConfig( + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', + level=logging.INFO + ) + logger = logging.getLogger(__name__) + ``` + - Log level: INFO + - Format includes timestamp, logger name, level, message + +**Patterns:** +- `logger.info()` for general informational messages + - Example: `logger.info("Starting Homelab Bot...")` + - Example: `logger.info(f"Inbox message from {user.first_name}: {message[:50]}...")` + - Example: `logger.info(f"Photo saved from {user.first_name}: {filepath}")` + +- Bash: Uses `echo` for output, no structured logging + - Informational messages for user feedback + - Error messages sent to stdout (not stderr) + +## Comments + +**When to Comment:** +- Module-level docstrings at top of file (required for all scripts) +- Usage examples in module docstrings (e.g., `pve`, `pbs`, `kuma`) +- Inline comments for complex logic (e.g., in `pbs` script parsing hex timestamps) +- Comments on tricky regex patterns (e.g., `pbs` tasks parsing) + +**Bash Comments:** +- Header comment with script name, purpose, and usage (lines 1-10) +- Inline comments before major sections (e.g., `# Datastore info`, `# Storage stats`) +- No comments in simple expressions + +**Python Comments:** +- Header comment with purpose (module docstring) +- Sparse inline comments except for complex sections +- Example from `telegram/bot.py` line 71: `# Telegram has 4096 char limit per message` +- Example from `pve` line 70: `# Try as container first` + +## Function Design + +**Size:** +- Python: Functions are generally 10-50 lines + - Smaller functions for simple operations (e.g., `is_authorized()` is 2 lines) + - Larger functions for command handlers that do setup + API calls (e.g., `status()` is 40 lines) + +- Bash: Functions are typically 20-80 lines + - Longer functions acceptable for self-contained operations like `cmd_status()` in `pbs` + +**Parameters:** +- Python: Explicit parameters, typically 1-5 parameters per function + - Optional parameters with defaults (e.g., `timeout: int = 30`, `port=45876`) + - Type hints not used consistently (some functions have them, many don't) + +- Bash: Parameters passed as positional arguments + - Some functions take zero parameters and rely on global variables + - Example: `ssh_pbs()` in `pbs` uses global `$PBS_HOST` + +**Return Values:** +- Python: Functions return data (strings, dicts, lists) or None + - Command handlers often return nothing (implicitly None) + - Helper functions return computed values (e.g., `is_authorized()` returns bool) + +- Bash: Functions print output directly, return exit codes + - No explicit return values beyond exit codes + - Output captured by caller with `$()` + +## Module Design + +**Exports:** +- Python: All functions are module-level, no explicit exports + - `if __name__ == "__main__":` pattern used in all scripts to guard main execution + - Example from `beszel` lines 101-152 + +- Bash: All functions are script-level, called via case statement + - Main dispatch logic at bottom of script + - Example from `dns` lines 29-106: `case "$1" in ... esac` + +**Async/Await (Telegram Bot Only):** +- Python telegram bot uses `asyncio` and `async def` for all handlers +- All command handlers are async (e.g., `async def start()`) +- Use `await` for async operations (e.g., `await update.message.reply_text()`) +- Example from `telegram/bot.py` lines 81-94: +```python +async def start(update: Update, context: ContextTypes.DEFAULT_TYPE): + """Handle /start command - first contact with bot.""" + user = update.effective_user + chat_id = update.effective_chat.id + # ... async operations with await +``` + +**File Structure:** +- Single-file modules: Most helpers are single files +- `telegram/bot.py`: Main bot implementation with all handlers +- `/bin/` scripts: Each script is self-contained with helper functions + main dispatch + +## Data Structures + +**JSON/Config Files:** +- Credentials files: Simple `KEY=value` format (no JSON) +- PBS task logging: Uses hex-encoded UPID format, parsed with regex +- Telegram bot: Saves messages to text files with timestamp prefix +- JSON output: Parsed with `python3 -c "import sys, json; ..."` in Bash scripts + +**Error Response Patterns:** +- API calls: Check for `.get('status') == 'ok'` or similar +- Command execution: Check `returncode == 0`, capture stdout/stderr +- API clients: Let exceptions bubble up, caught at command handler level + +## Conditionals and Flow Control + +**Python:** +- if/elif/else chains for command dispatch +- Simple truthiness checks: `if not user_id:`, `if not alerts:` +- Example from `telegram/bot.py` line 86-100: Authorization check pattern + +**Bash:** +- case/esac for command dispatch (preferred) +- if [[ ]] with regex matching for parsing +- Example from `pbs` lines 122-143: Complex regex with BASH_REMATCH array + +## Security Patterns + +**Credential Management:** +- Credentials stored in `~/.config//credentials` with restricted permissions (not enforced in code) +- Telegram token loaded from file, not environment +- Credentials never logged or printed + +**Input Validation:** +- Bash: Basic validation with isalnum() check in `ping_host()` function + - Example: `if not host.replace('.', '').replace('-', '').isalnum():` +- Bash: Whitelist command names from case statements +- No SQL injection risk (no databases used directly) + +**Shell Injection:** +- Bash scripts use quoted variables appropriately +- Some inline Python in Bash uses string interpolation (potential risk) + - Example from `dns` lines 31-37: `curl ... | python3 -c "..."` with variable interpolation + +--- + +*Convention analysis: 2026-02-04* diff --git a/.planning/codebase/INTEGRATIONS.md b/.planning/codebase/INTEGRATIONS.md new file mode 100644 index 0000000..4450bde --- /dev/null +++ b/.planning/codebase/INTEGRATIONS.md @@ -0,0 +1,261 @@ +# External Integrations + +**Analysis Date:** 2026-02-04 + +## APIs & External Services + +**Hypervisor Management:** +- **Proxmox VE (PVE)** - Cluster/node management + - SDK/Client: `proxmoxer` v2.2.0 (Python) + - Auth: Token-based (`root@pam!mgmt` token) + - Config: `~/.config/pve/credentials` + - Helper: `~/bin/pve` (list, status, start, stop, create-ct) + - Endpoint: https://65.108.14.165:8006 (local host core.georgsen.dk) + +**Backup Management:** +- **Proxmox Backup Server (PBS)** - Centralized backup infrastructure + - API: REST over HTTPS at 10.5.0.6:8007 + - Auth: Token-based (`root@pam!pve` token) + - Helper: `~/bin/pbs` (status, backups, tasks, errors, gc, snapshots, storage) + - Targets: core.georgsen.dk, pve01.warradejendomme.dk, pve02.warradejendomme.dk namespaces + - Datastore: Synology NAS via CIFS at 100.105.26.130 (Tailscale) + +**DNS Management:** +- **Technitium DNS** - Internal DNS with API + - API: REST at http://10.5.0.2:5380/api/ + - Auth: Username/password based + - Config: `~/.config/dns/credentials` + - Helper: `~/bin/dns` (list, records, add, delete, lookup) + - Internal zone: `lab.georgsen.dk` + - Upstream: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9) + +**Monitoring APIs:** +- **Uptime Kuma** - Status page & endpoint monitoring + - API: HTTP at 10.5.0.10:3001 + - SDK/Client: `uptime-kuma-api` v1.2.1 (Python) + - Auth: Username/password login + - Config: `~/.config/uptime-kuma/credentials` + - Helper: `~/bin/kuma` (list, info, add-http, add-port, add-ping, delete, pause, resume) + - URL: https://status.georgsen.dk + +- **Beszel** - Server metrics dashboard + - Backend: PocketBase REST API at 10.5.0.10:8090 + - SDK/Client: `pocketbase` v0.15.0 (Python) + - Auth: Admin email/password + - Config: `~/.config/beszel/credentials` + - Helper: `~/bin/beszel` (list, status, add, delete, alerts) + - URL: https://dashboard.georgsen.dk + - Agents: core (10.5.0.254), PBS (10.5.0.6), Dockge (10.5.0.10 + Docker stats) + - Data retention: 30 days (automatic) + +**Reverse Proxy & SSL:** +- **Nginx Proxy Manager (NPM)** - Reverse proxy with SSL + - API: JSON-RPC style (internal Docker API) + - Helper: `~/bin/npm-api` (--host-list, --host-create, --host-delete, --cert-list) + - Config: `~/.config/npm/npm-api.conf` (custom API wrapper) + - UI: http://10.5.0.1:81 (admin panel) + - SSL Provider: Let's Encrypt (HTTP-01 challenge) + - Access Control: NPM Access Lists (ID 1: "home_only" whitelist 83.89.248.247) + +**Git/Version Control:** +- **Forgejo** - Self-hosted Git server + - API: REST at 10.5.0.14:3000/api/v1/ + - Auth: API token based + - Config: `~/.config/forgejo/credentials` + - URL: https://git.georgsen.dk + - Repo: `git@10.5.0.14:mikkel/homelab.git` + - Version: v10.0.1 + +**Data Stores:** +- **DragonflyDB** - Redis-compatible in-memory store + - Host: 10.5.0.10 (Docker in Dockge) + - Port: 6379 + - Protocol: Redis protocol + - Auth: Password protected (`nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI`) + - Client: redis-cli or any Redis library + - Usage: Session/cache storage + +- **PostgreSQL** - Relational database + - Host: 10.5.0.109 (VMID 103) + - Default port: 5432 + - Managed by: Community (Proxmox LXC community images) + - Usage: Sentry system and other applications + +## Data Storage + +**Databases:** +- **PostgreSQL 13+** (VMID 103) + - Connection: `postgresql://user@10.5.0.109:5432/dbname` + - Client: psql (CLI) or any PostgreSQL driver + - Usage: Sentry defense intelligence system, application databases + +- **DragonflyDB** (Redis-compatible) + - Connection: `redis://10.5.0.10:6379` (with auth) + - Client: redis-cli or Python redis library + - Backup: Enabled in Docker config, persists to `./data/` + +- **Redis** (VMID 104, deprecated in favor of DragonflyDB) + - Host: 10.5.0.111 + - Status: Still active but DragonflyDB preferred + +**File Storage:** +- **Local Filesystem:** Each container has ZFS subvolume storage at / +- **Shared Storage (ZFS):** `/shared/mikkel/stuff` bind-mounted into containers + - PVE: `rpool/shared/mikkel` dataset + - mgmt (102): `~/stuff` with backup=1 (included in PBS backups) + - dev (111): `~/stuff` (shared access) + - general (113): `~/stuff` (shared access) + - SMB Access: `\\mgmt\stuff` via Tailscale MagicDNS + +**Backup Target:** +- **Synology NAS** (home network) + - Tailscale IP: 100.105.26.130 + - Mount: `/mnt/synology` on PBS + - Protocol: CIFS/SMB 3.0 + - Share: `/volume1/pbs-backup` + - UID mapping: Mapped to admin (squash: map all) + +## Authentication & Identity + +**Auth Providers:** +- **Proxmox PAM** - System-based authentication for PVE/PBS + - Users: root@pam, other system users + - Token auth: `root@pam!mgmt` (PVE), `root@pam!pve` (PBS) + +**SSH Key Authentication:** +- **Ed25519 keys** for user access + - Key: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIOQrK06zVkfY6C1ec69kEZYjf8tC98icCcBju4V751i mikkel@georgsen.dk` + - Deployed to all containers at `~/.ssh/authorized_keys` and `/root/.ssh/authorized_keys` + +**Telegram Bot Authentication:** +- **Telegram Bot Token** - Stored in `~/telegram/credentials` +- **Authorized Users:** Whitelist stored in `~/telegram/authorized_users` (chat IDs) +- **First user:** Auto-authorized on first `/start` command +- **Two-way messaging:** Text/photos/files saved to `~/telegram/inbox` + +## Monitoring & Observability + +**Error Tracking:** +- **Sentry** (custom defense intelligence system, VMID 105) + - Purpose: Monitor military contracting opportunities + - Databases: PostgreSQL (103) + Redis (104) + - Not a traditional error tracker - custom business intelligence system + +**Metrics & Monitoring:** +- **Beszel**: Server CPU, RAM, disk usage metrics +- **Uptime Kuma**: HTTP, TCP port, ICMP ping monitoring +- **PBS**: Backup task logs, storage metrics, dedup stats + +**Logs:** +- **PBS logs:** SSH queries via `~/bin/pbs`, stored on PBS container +- **Forgejo logs:** `/var/lib/forgejo/log/forgejo.log` (for fail2ban) +- **Telegram bot logs:** stdout to systemd service `telegram-bot.service` +- **Helper scripts:** Output to stdout, can be piped/redirected + +## CI/CD & Deployment + +**Hosting:** +- **Hetzner** (public cloud) - Primary: core.georgsen.dk (AX52) +- **Home Infrastructure** - Synology NAS for backups, future NUC cluster +- **Docker/Dockge** - Application deployment via Docker Compose (10.5.0.10) + +**CI Pipeline:** +- **None detected** - Manual deployment via Dockge or container management +- **Version control:** Forgejo (self-hosted Git server) +- **Update checks:** `~/bin/updates` script checks for updates across services + - Tracked: dragonfly, beszel, uptime-kuma, snappymail, dockge, npm, forgejo, dns, pbs + +**Deployment Tools:** +- **Dockge** - Docker Compose UI for stack management +- **PVE API** - Proxmox VE for container/VM provisioning +- **Helper scripts** - `~/bin/pve create-ct` for automated container creation + +## Environment Configuration + +**Required Environment Variables (in credential files):** + +DNS (`~/.config/dns/credentials`): +``` +DNS_HOST=10.5.0.2 +DNS_PORT=5380 +DNS_USER=admin +DNS_PASS= +``` + +Proxmox (`~/.config/pve/credentials`): +``` +host=65.108.14.165:8006 +user=root@pam +token_name=mgmt +token_value= +``` + +Uptime Kuma (`~/.config/uptime-kuma/credentials`): +``` +KUMA_HOST=10.5.0.10 +KUMA_PORT=3001 +KUMA_USER=admin +KUMA_PASS= +``` + +Beszel (`~/.config/beszel/credentials`): +``` +BESZEL_HOST=10.5.0.10 +BESZEL_PORT=8090 +BESZEL_USER=admin@example.com +BESZEL_PASS= +``` + +Telegram (`~/telegram/credentials`): +``` +TELEGRAM_BOT_TOKEN= +``` + +## Webhooks & Callbacks + +**Incoming Webhooks:** +- **Uptime Kuma** - No webhook ingestion detected +- **PBS** - Backup completion tasks (internal scheduling, no external webhooks) +- **Forgejo** - No webhook configuration documented + +**Outgoing Notifications:** +- **Telegram Bot** - Two-way messaging for homelab status + - Commands: /status, /pbs, /backups, /beszel, /kuma, /ping + - File uploads: Photos saved to `~/telegram/images/`, documents to `~/telegram/files/` + - Text inbox: Messages saved to `~/telegram/inbox` for Claude review + +**Event-Driven:** +- **PBS Scheduling** - Daily backup tasks at 01:00, 01:30, 02:00 (core, pve01, pve02) +- **Prune/GC** - Scheduled at 21:00 (prune) and 22:30 (garbage collection) + +## VPN & Remote Access + +**Tailscale Network:** +- **Primary relay:** 10.5.0.134 + 10.9.1.10 (VMID 1000, exit node capable) +- **Tailscale IPs:** + - PBS: 100.115.85.120 + - Synology NAS: 100.105.26.130 + - dev: 100.85.227.17 + - sentry: 100.83.236.113 + - Friends' nodes: pve01 (100.99.118.54), pve02 (100.82.87.108) + - Other devices: mge-t14, mikflix, xanderryzen, nvr01, tailscalemg + +**SSH Access Pattern:** +- All containers/VMs accessible via SSH from mgmt (102) +- SSH keys pre-deployed to all systems +- Tailscale used for accessing from external networks + +## External DNS + +**DNS Provider:** dns.services (Danish free DNS with API) +- Domains managed: + - georgsen.dk + - dataloes.dk + - microsux.dk + - warradejendomme.dk +- Used for external domain registration only +- Internal zone lookups go to Technitium (10.5.0.2) + +--- + +*Integration audit: 2026-02-04* diff --git a/.planning/codebase/STACK.md b/.planning/codebase/STACK.md new file mode 100644 index 0000000..f663cc0 --- /dev/null +++ b/.planning/codebase/STACK.md @@ -0,0 +1,152 @@ +# Technology Stack + +**Analysis Date:** 2026-02-04 + +## Languages + +**Primary:** +- **Bash** - Infrastructure automation, API wrappers, system integration + - Helper scripts at `~/bin/` for service APIs + - Installation and setup in `pve-homelab-kit/install.sh` + +- **Python 3.12.3** - Management tools, monitoring, bot automation + - Virtual environment: `~/venv/` (activated with `source ~/venv/bin/activate`) + - Primary usage: API clients, Telegram bot, helper scripts + +## Runtime + +**Environment:** +- **Python 3.12.3** (system) +- **Bash 5+** (system shell) + +**Package Manager:** +- **pip** v24.0 (Python package manager) +- Lockfile: Virtual environment at `~/venv/` (not traditional pip.lock) + +## Frameworks + +**Core Infrastructure:** +- **Proxmox VE** (v8.x) - Hypervisor/container platform on core.georgsen.dk +- **Proxmox Backup Server (PBS)** v2.x - Backup infrastructure (10.5.0.6:8007) +- **LXC Containers** - Primary virtualization method +- **KVM VMs** - Full VMs when needed (mail server VM 200) +- **Docker/Docker Compose** - Application deployment via Dockge (10.5.0.10) + +**Application Frameworks:** +- **Nginx Proxy Manager (NPM)** v2.x - Reverse proxy, SSL (10.5.0.1:80/443/81) +- **Dockge** - Docker Compose stack management UI (10.5.0.10:5001) +- **Forgejo** v10.0.1 - Self-hosted Git server (10.5.0.14:3000) +- **Technitium DNS** - DNS server with API (10.5.0.2:5380) + +**Monitoring & Observability:** +- **Uptime Kuma** - Service/endpoint monitoring (10.5.0.10:3001) +- **Beszel** - Server metrics dashboard (10.5.0.10:8090) + +**Messaging:** +- **Stalwart Mail Server** - Mail server (VM 200, IP 65.108.14.164) +- **Snappymail** - Webmail UI (djmaze/snappymail:latest, 10.5.0.10:8888) + +**Data Storage:** +- **DragonflyDB** - Redis-compatible in-memory datastore (10.5.0.10:6379) + - Password protected, used for session/cache storage +- **PostgreSQL 13+** (VMID 103, 10.5.0.109) - Community managed database +- **Redis/DragonflyDB** (VMID 104, 10.5.0.111) - Session/cache store + +## Key Dependencies + +**Python Packages (in ~/venv/):** + +**Proxmox API:** +- `proxmoxer` v2.2.0 - Python API client for Proxmox VE + - File: `~/bin/pve` (list, status, start, stop, create-ct operations) + +**Monitoring APIs:** +- `uptime-kuma-api` v1.2.1 - Uptime Kuma monitoring client + - File: `~/bin/kuma` (monitor management) +- `pocketbase` v0.15.0 - Beszel dashboard backend client + - File: `~/bin/beszel` (system monitoring) + +**Communications:** +- `python-telegram-bot` v22.5 - Telegram Bot API + - File: `~/telegram/bot.py` (homelab management bot) + +**HTTP Clients:** +- `requests` v2.32.5 - HTTP library for API calls +- `httpx` v0.28.1 - Async HTTP client +- `urllib3` v2.6.3 - Low-level HTTP client + +**Networking & WebSockets:** +- `websocket-client` v1.9.0 - WebSocket client library +- `python-socketio` v5.16.0 - Socket.IO client +- `simple-websocket` v1.1.0 - WebSocket utilities + +**Utilities:** +- `certifi` v2026.1.4 - SSL certificate verification +- `charset-normalizer` v3.4.4 - Character encoding detection +- `packaging` v25.0 - Version/requirement parsing + +## Configuration + +**Environment:** +- **Bash scripts:** Load credentials from `~/.config/{service}/credentials` files + - `~/.config/pve/credentials` - Proxmox API token + - `~/.config/dns/credentials` - Technitium DNS API + - `~/.config/beszel/credentials` - Beszel dashboard API + - `~/.config/uptime-kuma/credentials` - Uptime Kuma API + - `~/.config/forgejo/credentials` - Forgejo Git API +- **Python scripts:** Similar credential loading pattern +- **Telegram bot:** `~/telegram/credentials` file with `TELEGRAM_BOT_TOKEN` + +**Build & Runtime Configuration:** +- Python venv activation: `source ~/venv/bin/activate` +- Helper scripts use shebang: `#!/home/mikkel/venv/bin/python3` or `#!/bin/bash` +- All scripts in `~/bin/` are executable and PATH-accessible + +**Documentation:** +- `CLAUDE.md` - Development environment guidance +- `homelab-documentation.md` - Infrastructure reference (22KB, comprehensive) +- `README.md` - Quick container/service overview +- `TODO.md` - Pending maintenance tasks + +## Platform Requirements + +**Development/Management:** +- **Container:** LXC on Proxmox VE (VMID 102, "mgmt") +- **OS:** Debian-based Linux (venv requires Linux filesystem) +- **User:** mikkel (UID 1000, group georgsen GID 1000) +- **SSH:** Pre-installed keys for accessing other containers/VMs +- **Network:** Tailscale VPN for external access, internal vmbr1 (10.5.0.0/24) + +**Production (Core Server):** +- **Provider:** Hetzner AX52 (Helsinki) +- **CPU:** AMD Ryzen 7 3700X +- **RAM:** 64GB ECC +- **Storage:** 2x 1TB NVMe (RAID0 via ZFS) +- **Public IP:** 65.108.14.165/26 (BGP routed) +- **Network bridges:** vmbr0 (public), vmbr1 (internal), vmbr2 (vSwitch) + +**Backup Target:** +- **Synology NAS** (home network via Tailscale) +- **Protocol:** CIFS/SMB 3.0 over Tailscale +- **Mount point on PBS:** `/mnt/synology` (bind-mounted as datastore) + +## Deployment & Access + +**Service URLs:** +- **Proxmox Web UI:** https://65.108.14.165:8006 (public, home IP whitelisted) +- **NPM Admin:** http://10.5.0.1:81 (internal only) +- **DNS Admin:** https://dns.georgsen.dk (home IP whitelisted via access list) +- **PBS Web UI:** https://pbs.georgsen.dk:8007 (home IP whitelisted) +- **Dockge Admin:** https://dockge.georgsen.dk:5001 (home IP whitelisted) +- **Forgejo:** https://git.georgsen.dk (public) +- **Status Page:** https://status.georgsen.dk (Uptime Kuma) +- **Dashboard:** https://dashboard.georgsen.dk (Beszel metrics) + +**SSL Certificates:** +- **Provider:** Let's Encrypt via NPM +- **Challenge method:** HTTP-01 +- **Auto-renewal:** Handled by NPM + +--- + +*Stack analysis: 2026-02-04* diff --git a/.planning/codebase/STRUCTURE.md b/.planning/codebase/STRUCTURE.md new file mode 100644 index 0000000..bd6c255 --- /dev/null +++ b/.planning/codebase/STRUCTURE.md @@ -0,0 +1,228 @@ +# Codebase Structure + +**Analysis Date:** 2026-02-04 + +## Directory Layout + +``` +/home/mikkel/homelab/ +├── .planning/ # Planning and analysis artifacts +│ └── codebase/ # Codebase documentation (ARCHITECTURE.md, STRUCTURE.md, etc.) +├── .git/ # Git repository metadata +├── telegram/ # Telegram bot and message storage +│ ├── bot.py # Main bot implementation +│ ├── credentials # Telegram bot token (env var: TELEGRAM_BOT_TOKEN) +│ ├── authorized_users # Allowlist of chat IDs (one per line) +│ ├── inbox # Messages from admin (appended on each message) +│ ├── images/ # Photos sent via Telegram (timestamped) +│ └── files/ # Files sent via Telegram (timestamped) +├── pve-homelab-kit/ # PVE installation kit (subproject) +│ ├── install.sh # Installation script +│ ├── PROMPT.md # Project context for Claude +│ ├── .planning/ # Subproject planning docs +│ └── README.md # Setup instructions +├── npm/ # Nginx Proxy Manager configuration +│ └── npm-api.conf # API credentials reference +├── dockge/ # Docker Compose Manager configuration +│ └── credentials # Dockge API access +├── dns/ # Technitium DNS configuration +│ └── credentials # DNS API credentials (env vars: DNS_HOST, DNS_PORT, DNS_USER, DNS_PASS) +├── dns-services/ # DNS services configuration +│ └── credentials # Alternative DNS credentials +├── pve/ # Proxmox VE configuration +│ └── credentials # PVE API credentials (env vars: host, user, token_name, token_value) +├── beszel/ # Beszel monitoring dashboard +│ ├── credentials # Beszel API credentials +│ └── README.md # API and agent setup guide +├── forgejo/ # Forgejo Git server configuration +│ └── credentials # Forgejo API access +├── uptime-kuma/ # Uptime Kuma monitoring +│ ├── credentials # Kuma API credentials (env vars: KUMA_HOST, KUMA_PORT, KUMA_API_KEY) +│ ├── README.md # REST API reference and Socket.IO documentation +│ └── kuma_api_doc.png # Full API documentation screenshot +├── README.md # Repository overview and service table +├── CLAUDE.md # Claude Code guidance and infrastructure quick reference +├── homelab-documentation.md # Authoritative infrastructure documentation +├── TODO.md # Pending maintenance tasks +└── .gitignore # Git ignore patterns (credentials, sensitive files) +``` + +## Directory Purposes + +**telegram/:** +- Purpose: Two-way Telegram bot for management commands and admin notifications +- Contains: Python bot code, token credentials, authorized user allowlist, message inbox, uploaded media +- Key files: `bot.py` (407 lines), `credentials`, `authorized_users`, `inbox` +- Not committed: `credentials`, `inbox`, `images/*`, `files/*` (in `.gitignore`) + +**pve-homelab-kit/:** +- Purpose: Standalone PVE installation and initial setup toolkit +- Contains: Installation script, configuration examples, planning documents +- Key files: `install.sh` (executable automation), `PROMPT.md` (context for Claude), subproject `.planning/` +- Notes: Separate git repository (submodule or independent), for initial PVE deployment + +**npm/:** +- Purpose: Nginx Proxy Manager reverse proxy configuration +- Contains: API credentials reference +- Key files: `npm-api.conf` + +**dns/ & dns-services/:** +- Purpose: Technitium DNS server configuration (dual credential sets) +- Contains: API authentication credentials +- Key files: `credentials` (host, port, user, password) + +**pve/:** +- Purpose: Proxmox VE API access credentials +- Contains: Token-based authentication data +- Key files: `credentials` (host, user, token_name, token_value) + +**dockge/, forgejo/, beszel/, uptime-kuma/:** +- Purpose: Service-specific API credentials and documentation +- Contains: Token/API key for each service +- Key files: `credentials`, service-specific `README.md` (beszel, uptime-kuma) + +**homelab-documentation.md:** +- Purpose: Authoritative reference for all infrastructure details +- Contains: Network topology, VM/container registry, service mappings, security rules, firewall config +- Must be updated whenever: services added/removed, IPs changed, configurations modified + +**CLAUDE.md:** +- Purpose: Claude Code (AI assistant) guidance and quick reference +- Contains: Environment setup, helper script signatures, API access patterns, security notes +- Auto-loaded by Claude when working in this repository + +**.planning/codebase/:** +- Purpose: GSD codebase analysis artifacts +- Will contain: ARCHITECTURE.md, STRUCTURE.md, CONVENTIONS.md, TESTING.md, STACK.md, INTEGRATIONS.md, CONCERNS.md +- Generated by: GSD codebase mapper, consumed by GSD planner/executor + +## Key File Locations + +**Entry Points:** +- `telegram/bot.py`: Telegram bot entry point (asyncio-based) +- `pve-homelab-kit/install.sh`: Initial PVE setup entry point + +**Configuration:** +- `homelab-documentation.md`: Infrastructure reference (IPs, ports, network topology, firewall rules) +- `CLAUDE.md`: Claude Code environment setup and quick reference +- `.planning/`: Planning and analysis artifacts + +**Core Logic:** +- `~/bin/pve`: Proxmox VE API wrapper (Python, 200 lines) +- `~/bin/dns`: Technitium DNS API wrapper (Bash, 107 lines) +- `~/bin/pbs`: PBS backup status and management (Bash, 400+ lines) +- `~/bin/beszel`: Beszel monitoring dashboard API (Bash/Python, 137 lines) +- `~/bin/kuma`: Uptime Kuma monitor management (Bash, 144 lines) +- `~/bin/updates`: Service version checking and updates (Bash, 450+ lines) +- `~/bin/telegram`: CLI helper for Telegram bot control (2-way messaging) +- `~/bin/npm-api`: NPM reverse proxy management (wrapper script) +- `telegram/bot.py`: Telegram bot with command handlers and media management + +**Testing:** +- Not applicable (no automated tests in this repository) + +## Naming Conventions + +**Files:** +- Lowercase with hyphens for multi-word names: `npm-api`, `uptime-kuma`, `pve-homelab-kit` +- Markdown documentation: UPPERCASE.md (`README.md`, `CLAUDE.md`, `homelab-documentation.md`) +- Configuration/credential files: lowercase `credentials` with optional zone prefix + +**Directories:** +- Service-specific: lowercase, match service name (`npm`, `dns`, `dockge`, `forgejo`, `beszel`, `telegram`) +- Functional: category name (`pve`, `pve-homelab-kit`) +- Hidden: `.planning`, `.git` for system metadata + +**Variables & Parameters:** +- Environment variables: UPPERCASE_WITH_UNDERSCORES (e.g., `TELEGRAM_BOT_TOKEN`, `DNS_HOST`, `KUMA_API_KEY`) +- Bash functions: lowercase_with_underscores (e.g., `get_token()`, `run_command()`, `ssh_pbs()`) +- Python functions: lowercase_with_underscores (e.g., `is_authorized()`, `run_command()`, `get_status()`) + +## Where to Add New Code + +**New Helper Script (CLI tool):** +- Primary code: `~/bin/{service_name}` (no extension, executable) +- Credentials: `~/.config/{service_name}/credentials` +- Documentation: Top-of-file comment with usage examples +- Language: Bash for shell commands/APIs, Python for complex logic (use Python venv) + +**New Service Configuration:** +- Directory: `/home/mikkel/homelab/{service_name}/` +- Credentials file: `{service_name}/credentials` +- Documentation: `{service_name}/README.md` (include API examples and setup) +- Git handling: All credentials in `.gitignore`, document as `credentials.example` if needed + +**New Telegram Bot Command:** +- File: `telegram/bot.py` (add function to existing handlers section) +- Pattern: Async function named `cmd_name()`, check authorization first with `is_authorized()` +- Result: Send back via `update.message.reply_text()` +- Timeout: Default 30 seconds (configurable via `run_command()`) + +**New Documentation:** +- Infrastructure changes: Update `homelab-documentation.md` (IPs, service registry, network config) +- Claude Code guidance: Update `CLAUDE.md` (new helper scripts, environment setup) +- Service-specific: Create `{service_name}/README.md` with API examples and access patterns + +**Shared Utilities:** +- Location: Create in `~/lib/` or `~/venv/lib/` for Python packages +- Access: Import in other scripts or source in Bash + +## Special Directories + +**.planning/codebase/:** +- Purpose: GSD analysis artifacts +- Generated: Yes (by GSD codebase mapper) +- Committed: Yes (part of repository for reference) + +**telegram/images/ & telegram/files/:** +- Purpose: Media uploaded via Telegram bot +- Generated: Yes (bot downloads on receipt) +- Committed: No (in `.gitignore`) + +**telegram/inbox:** +- Purpose: Admin messages to Claude +- Generated: Yes (bot appends messages) +- Committed: No (in `.gitignore`) + +**.git/** +- Purpose: Git repository metadata +- Generated: Yes (by git) +- Committed: No (system directory) + +**pve-homelab-kit/.planning/** +- Purpose: Subproject planning documents +- Generated: Yes (by GSD mapper on subproject) +- Committed: Yes (tracked in subproject) + +## Credential File Organization + +All credentials stored in `~/.config/{service}/credentials` using key=value format (one per line): + +```bash +# ~/.config/pve/credentials +host=core.georgsen.dk +user=root@pam +token_name=automation +token_value= + +# ~/.config/dns/credentials +DNS_HOST=10.5.0.2 +DNS_PORT=5380 +DNS_USER=admin +DNS_PASS= + +# ~/.config/beszel/credentials +BESZEL_HOST=10.5.0.10 +BESZEL_PORT=8090 +BESZEL_USER= +BESZEL_PASS= +``` + +**Loading Pattern:** +- Bash: `source ~/.config/{service}/credentials` or inline `$(cat ~/.config/{service}/credentials | grep ^KEY= | cut -d= -f2-)` +- Python: Read file, parse `key=value` lines into dict +- Never hardcode credentials in scripts + +--- + +*Structure analysis: 2026-02-04* diff --git a/.planning/codebase/TESTING.md b/.planning/codebase/TESTING.md new file mode 100644 index 0000000..7cf6e68 --- /dev/null +++ b/.planning/codebase/TESTING.md @@ -0,0 +1,324 @@ +# Testing Patterns + +**Analysis Date:** 2026-02-04 + +## Test Framework + +**Current State:** +- **No automated testing detected** in this codebase +- No test files found (no `*.test.py`, `*_test.py`, `*.spec.py` files) +- No testing configuration files (no `pytest.ini`, `tox.ini`, `setup.cfg`) +- No test dependencies in requirements (no pytest, unittest, mock imports) + +**Implications:** +This is a **scripts-only codebase** - all code consists of CLI helper scripts and one bot automation. Manual testing is the primary validation method. + +## Script Testing Approach + +Since this codebase consists entirely of helper scripts and automation, testing is manual and implicit: + +**Command-Line Validation:** +- Each script has a usage/help message showing all commands +- Example from `pve`: + ```python + if len(sys.argv) < 2: + print(__doc__) + sys.exit(1) + ``` +- Example from `telegram`: + ```bash + case "${1:-}" in + send) cmd_send "$2" ;; + inbox) cmd_inbox ;; + *) usage; exit 1 ;; + esac + ``` + +**Entry Point Testing:** +Main execution guards are used throughout: +```python +if __name__ == "__main__": + main() +``` + +This allows scripts to be imported (theoretically) without side effects, though in practice they are not used as modules. + +## API Integration Testing + +**Pattern: Try-Except Fallback:** +Many scripts handle multiple service types by trying different approaches: + +From `pve` script (lines 55-85): +```python +def get_status(vmid): + """Get detailed status of a VM/container.""" + vmid = int(vmid) + # Try as container first + try: + status = pve.nodes(NODE).lxc(vmid).status.current.get() + # ... container-specific logic + return + except: + pass + + # Try as VM + try: + status = pve.nodes(NODE).qemu(vmid).status.current.get() + # ... VM-specific logic + return + except: + pass + + print(f"VMID {vmid} not found") +``` + +This is a pragmatic testing pattern: if one API call fails, try another. Useful for development but fragile without structured error handling. + +## Command Dispatch Testing + +**Pattern: Argument Validation:** +All scripts validate argument count before executing commands: + +From `beszel` script (lines 101-124): +```python +if __name__ == "__main__": + if len(sys.argv) < 2: + usage() + + cmd = sys.argv[1] + + try: + if cmd == "list": + cmd_list() + elif cmd == "info" and len(sys.argv) == 3: + cmd_info(sys.argv[2]) + elif cmd == "add" and len(sys.argv) >= 4: + # ... + else: + usage() + except Exception as e: + print(f"Error: {e}") + sys.exit(1) +``` + +This catches typos in command names and wrong argument counts, showing usage help. + +## Data Processing Testing + +**Bash String Parsing:** +Complex regex patterns used in `pbs` script require careful testing: + +From `pbs` (lines 122-143): +```bash +ssh_pbs 'tail -500 /var/log/proxmox-backup/tasks/archive 2>/dev/null' | while IFS= read -r line; do + if [[ "$line" =~ UPID:pbs:[^:]+:[^:]+:[^:]+:([0-9A-Fa-f]+):([^:]+):([^:]+):.*\ [0-9A-Fa-f]+\ (OK|ERROR|WARNINGS[^$]*) ]]; then + task_time=$((16#${BASH_REMATCH[1]})) + task_type="${BASH_REMATCH[2]}" + task_target="${BASH_REMATCH[3]}" + status="${BASH_REMATCH[4]}" + # ... process matched groups + fi +done +``` + +**Manual Testing Approach:** +- Run command against live services +- Inspect output format visually +- Verify JSON parsing with inline Python: + ```bash + echo "$gc_json" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('disk-bytes',0))" + ``` + +## Mock Testing Pattern (Telegram Bot) + +The telegram bot has one pattern that resembles mocking - subprocess mocking via `run_command()`: + +From `telegram/bot.py` (lines 60-78): +```python +def run_command(cmd: list, timeout: int = 30) -> str: + """Run a shell command and return output.""" + try: + result = subprocess.run( + cmd, + capture_output=True, + text=True, + timeout=timeout, + env={**os.environ, 'PATH': f"/home/mikkel/bin:{os.environ.get('PATH', '')}"} + ) + output = result.stdout or result.stderr or "No output" + # Telegram has 4096 char limit per message + if len(output) > 4000: + output = output[:4000] + "\n... (truncated)" + return output + except subprocess.TimeoutExpired: + return "Command timed out" + except Exception as e: + return f"Error: {e}" +``` + +This function: +- Runs external commands with timeout protection +- Handles both stdout and stderr +- Truncates output for Telegram's message size limits +- Returns error messages instead of raising exceptions + +This enables testing command handlers by mocking which commands are available. + +## Timeout Testing + +The telegram bot handles timeouts explicitly: + +From `telegram/bot.py`: +```python +result = subprocess.run( + ["ping", "-c", "3", "-W", "2", host], + capture_output=True, + text=True, + timeout=10 # 10 second timeout +) +``` + +Different commands have different timeouts: +- `ping_host()`: 10 second timeout +- `run_command()`: 30 second default (configurable) +- `backups()`: 60 second timeout (passed to run_command) + +This prevents the bot from hanging on slow/unresponsive services. + +## Error Message Testing + +Scripts validate successful API responses: + +From `dns` script (lines 62-69): +```bash +curl -s "$BASE/zones/records/add?..." | python3 -c " +import sys, json +data = json.load(sys.stdin) +if data['status'] == 'ok': + print(f\"Added: {data['response']['addedRecord']['name']} -> ...\") +else: + print(f\"Error: {data.get('errorMessage', 'Unknown error')}\") +" +``` + +This pattern: +- Parses JSON response +- Checks status field +- Returns user-friendly error message on failure + +## Credential Testing + +Scripts assume credentials exist and are properly formatted: + +From `pve` (lines 17-34): +```python +creds_path = Path.home() / ".config" / "pve" / "credentials" +creds = {} +with open(creds_path) as f: + for line in f: + if "=" in line: + key, value = line.strip().split("=", 1) + creds[key] = value + +pve = ProxmoxAPI( + creds["host"], + user=creds["user"], + token_name=creds["token_name"], + token_value=creds["token_value"], + verify_ssl=False +) +``` + +**Missing Error Handling:** +- No check that credentials file exists +- No check that required keys are present +- No validation that API connection succeeds +- Will crash with KeyError or FileNotFoundError if file missing + +**Recommendation for Testing:** +Add pre-flight validation: +```python +required_keys = ["host", "user", "token_name", "token_value"] +missing = [k for k in required_keys if k not in creds] +if missing: + print(f"Error: Missing credentials: {', '.join(missing)}") + sys.exit(1) +``` + +## File I/O Testing + +Telegram bot handles file operations defensively: + +From `telegram/bot.py` (lines 277-286): +```python +# Create images directory +images_dir = Path(__file__).parent / 'images' +images_dir.mkdir(exist_ok=True) + +# Get the largest photo (best quality) +photo = update.message.photo[-1] +file = await context.bot.get_file(photo.file_id) + +# Download the image +filename = f"{file_timestamp}.jpg" +filepath = images_dir / filename +await file.download_to_drive(filepath) +``` + +**Patterns:** +- `mkdir(exist_ok=True)`: Safely creates directory, doesn't error if exists +- Timestamp-based filenames to avoid collisions: `f"{file_timestamp}_{original_name}"` +- Pathlib for cross-platform path handling + +## What to Test If Writing Tests + +If converting to automated tests, prioritize: + +**High Priority:** +1. **Telegram bot command dispatch** (`telegram/bot.py` lines 107-366) + - Each command handler should have unit tests + - Mock `subprocess.run()` to avoid calling actual commands + - Test authorization checks (`is_authorized()`) + - Test output truncation for large responses + +2. **Credential loading** (all helper scripts) + - Test missing credentials file error + - Test malformed credentials + - Test missing required keys + +3. **API response parsing** (`dns`, `pbs`, `beszel`, `kuma`) + - Test JSON parsing errors + - Test malformed responses + - Test status code handling + +**Medium Priority:** +1. **Bash regex parsing** (`pbs` task/error log parsing) + - Test hex timestamp conversion + - Test status code extraction + - Test task target parsing with special characters + +2. **Timeout handling** (all `run_command()` calls) + - Test command timeout + - Test output truncation + - Test error message formatting + +**Low Priority:** +1. Integration tests with real services (kept in separate test suite) +2. Performance tests for large data sets + +## Current Test Coverage + +**Implicit Testing:** +- Manual CLI testing during development +- Live service testing (commands run against real PVE, PBS, DNS, etc.) +- User/admin interaction testing (Telegram bot testing via /start, /status, etc.) + +**Gap:** +- No regression testing +- No automated validation of API response formats +- No error case testing +- No refactoring safety net + +--- + +*Testing analysis: 2026-02-04*