Codebase: 7 documents (stack, architecture, structure, conventions, testing, integrations, concerns) Research: 5 documents (stack, features, architecture, pitfalls, summary)
272 lines
17 KiB
Markdown
272 lines
17 KiB
Markdown
# Codebase Concerns
|
|
|
|
**Analysis Date:** 2026-02-04
|
|
|
|
## Tech Debt
|
|
|
|
**IP Addressing Scheme Inconsistency:**
|
|
- Issue: Container IPs don't follow VMID convention. NPM (VMID 100) is at .1, Dockge (VMID 101) at .10, PBS (VMID 106) at .6, instead of matching .100, .101, .106
|
|
- Files: `homelab-documentation.md` (lines 139-159)
|
|
- Impact: Manual IP tracking required, DNS records must be maintained separately, new containers require manual IP assignment planning, documentation drift risk
|
|
- Fix approach: Execute TODO task to reorganize vmbr1 to VMID=IP scheme (.100-.253 range), update NPM proxy hosts, DNS records (lab.georgsen.dk), and documentation
|
|
|
|
**DNS Record Maintenance Manual:**
|
|
- Issue: Internal DNS (Technitium) and external DNS (dns.services) require manual updates when IPs/domains change
|
|
- Files: `homelab-documentation.md` (lines 432-449), `~/bin/dns` script
|
|
- Impact: Risk of records becoming stale after IP migrations, no automation for new containers
|
|
- Fix approach: Implement `dns-services` helper script (TODO.md line 27) with API integration for automatic updates
|
|
|
|
**Unimplemented Helper Scripts:**
|
|
- Issue: `dns-services` API integration promised in TODO but not implemented
|
|
- Files: `TODO.md` (line 27), `dns-services/credentials` exists but script doesn't
|
|
- Impact: Manual dns.services operations required, cannot automate domain setup
|
|
- Fix approach: Create `~/bin/dns-services` wrapper (endpoint documented in TODO)
|
|
|
|
**Ping Capability Missing on 12 Containers:**
|
|
- Issue: Unprivileged LXC containers drop cap_net_raw, breaking ping on VMIDs 100, 101, 102, 103, 104, 105, 107, 108, 110, 111, 112, 114, 115, 1000
|
|
- Files: `TODO.md` (lines 31-33), `CLAUDE.md` (line 252-255)
|
|
- Impact: Health monitoring fails, network diagnostics broken, Telegram bot status checks incomplete (bot has no ping on home network itself), Uptime Kuma monitors may show false negatives
|
|
- Fix approach: Run `setcap cap_net_raw+ep /bin/ping` on each container (must be reapplied after iputils-ping updates)
|
|
|
|
**Version Pinning Warnings:**
|
|
- Issue: CLAUDE.md section 227-241 warns about hardcoded versions becoming stale
|
|
- Files: `homelab-documentation.md` (lines 217, 228, 239), `~/bin/updates` script shows version checking is implemented but some configs have `latest` tags
|
|
- Impact: Security patch delays, incompatibilities when manually deploying services
|
|
- Fix approach: Always query GitHub API for latest versions (updates script does this correctly for discovery phase)
|
|
|
|
## Known Bugs
|
|
|
|
**Telegram Bot Inbox Storage Race Condition:**
|
|
- Symptoms: Concurrent message writes could corrupt inbox file, messages may be lost
|
|
- Files: `telegram/bot.py` (lines 39, 200-220 message handling), `~/bin/telegram` (lines 73-79 clear command)
|
|
- Trigger: Multiple rapid messages from admin or concurrent bot operations
|
|
- Workaround: Clear inbox frequently and check for corruption; bot currently appends to file without locking
|
|
- Root cause: File-based inbox with no atomic writes or mutex protection
|
|
|
|
**PBS Backup Mount Dependency Not Enforced:**
|
|
- Symptoms: PBS services may start before Synology CIFS mount is available, backup path unreachable
|
|
- Files: `homelab-documentation.md` (lines 372-384), container 106 config
|
|
- Trigger: System reboot when Tailscale connectivity is delayed
|
|
- Workaround: Manual restart of proxmox-backup-proxy and proxmox-backup services
|
|
- Root cause: systemd dependency chain `After=mnt-synology.mount` doesn't guarantee mount is ready at service start time
|
|
|
|
**DragonflyDB Password in Plain Text in Documentation:**
|
|
- Symptoms: Database password visible in compose file and documentation
|
|
- Files: `homelab-documentation.md` (lines 248-250)
|
|
- Trigger: Anyone reading docs or inspecting git history
|
|
- Workaround: Consider password non-critical if container only accessible on internal network
|
|
- Root cause: Password stored in version control and documentation rather than .env or secrets file
|
|
|
|
**NPM Proxy Host 18 (mh.datalos.dk) Not Configured:**
|
|
- Symptoms: Domain not resolving despite DNS record missing and NPM entry (ID 18) mentioned in TODO
|
|
- Files: `TODO.md` (line 29), `homelab-documentation.md` (proxy hosts section)
|
|
- Trigger: Accessing mh.datalos.dk from browser
|
|
- Workaround: Must be configured manually via NPM web UI
|
|
- Root cause: Setup referenced in TODO but not completed
|
|
|
|
## Security Considerations
|
|
|
|
**Exposed Credentials in Git History:**
|
|
- Risk: Credential files committed (credentials, SSH keys, telegram token examples)
|
|
- Files: All credential files in `telegram/`, `pve/`, `forgejo/`, `dns/`, `dockge/`, `uptime-kuma/`, `beszel/`, `dns-services/` directories (8+ files)
|
|
- Current mitigation: Files are .gitignored in main repo but present in working directory
|
|
- Recommendations: Rotate all credentials listed, audit git log for historical commits, use HashiCorp Vault or pass for credential storage, document secret rotation procedure
|
|
|
|
**Public IP Hardcoded in Documentation:**
|
|
- Risk: Home IP 83.89.248.247 exposed in multiple locations
|
|
- Files: `homelab-documentation.md` (lines 98, 102), `CLAUDE.md` (line 256)
|
|
- Current mitigation: IP is already public/static, used for whitelist access
|
|
- Recommendations: Document that whitelisting this IP is intentional, no other PII mixed in
|
|
|
|
**Telegram Bot Authorization Model Too Permissive:**
|
|
- Risk: First user to message bot becomes admin automatically with no verification
|
|
- Files: `telegram/bot.py` (lines 86-95)
|
|
- Current mitigation: Bot only responds to authorized user, requires bot discovery
|
|
- Recommendations: Require multi-factor authorization on first start (e.g., PIN from environment variable), implement audit logging of all bot commands
|
|
|
|
**Database Credentials in Environment Variables:**
|
|
- Risk: DragonflyDB password passed via Docker command line (visible in `docker ps`, logs, process listings)
|
|
- Files: `homelab-documentation.md` (line 248)
|
|
- Current mitigation: Container only accessible on internal vmbr1 network
|
|
- Recommendations: Use Docker secrets or mounted .env files instead of command-line arguments
|
|
|
|
**Synology CIFS Credentials in fstab:**
|
|
- Risk: SMB credentials stored in plaintext in fstab file with mode 0644 (world-readable)
|
|
- Files: `homelab-documentation.md` (line 369)
|
|
- Current mitigation: Mounted on container-only network, requires PBS container access
|
|
- Recommendations: Use credentials file with mode 0600, rotate credentials regularly, monitor file permissions
|
|
|
|
**SSH Keys Included in Documentation:**
|
|
- Risk: Public SSH keys hardcoded in CLAUDE.md setup examples
|
|
- Files: `CLAUDE.md` and `homelab-documentation.md` SSH key examples
|
|
- Current mitigation: Public keys only (not private), used for container access
|
|
- Recommendations: Rotate these keys if documentation is ever exposed, don't include in public repos
|
|
|
|
## Performance Bottlenecks
|
|
|
|
**Single NVMe Storage (RAID0) Without Local Redundancy:**
|
|
- Problem: Core server has 2x1TB NVMe in RAID0 (striped, no redundancy)
|
|
- Files: `homelab-documentation.md` (lines 17-24)
|
|
- Cause: Cost optimization for Hetzner dedicated server
|
|
- Impact: Single drive failure = total data loss; database corruption risk from RAID0 stripe inconsistency
|
|
- Improvement path: (1) Ensure PBS backups run successfully to Synology, (2) Test backup restore procedure monthly, (3) Plan upgrade path if budget allows (3-way mirror or RAID1)
|
|
|
|
**Backup Dependency on Single Tailscale Gateway:**
|
|
- Problem: All PBS backups to Synology go through Tailscale relay (10.5.0.134), single point of failure
|
|
- Files: `homelab-documentation.md` (lines 317-427)
|
|
- Cause: Synology only accessible via Tailscale network, relay container required
|
|
- Impact: Tailscale relay downtime = backup failure; no local backup option
|
|
- Improvement path: (1) Add second Tailscale relay for redundancy, (2) Explore PBS direct SSH backup mode, (3) Monitor relay container health
|
|
|
|
**DNS Queries All Route Through Single Technitium Container:**
|
|
- Problem: All internal DNS (lab.georgsen.dk) goes through container 115, DHCP defaults to this server
|
|
- Files: `homelab-documentation.md` (lines 309-315), container config
|
|
- Cause: Single container architecture
|
|
- Impact: DNS outage = network unreachable (containers can't resolve any hostnames)
|
|
- Improvement path: (1) Deploy DNS replica on another container, (2) Configure DHCP to use multiple DNS servers, (3) Set upstream DNS fallback
|
|
|
|
**Script Execution via Telegram Bot with Subprocess Timeout:**
|
|
- Problem: Bot runs helper scripts with 30-second timeout, commands like PBS backup query can exceed limit
|
|
- Files: `telegram/bot.py` (lines 60-78, 191)
|
|
- Cause: Helper scripts do remote SSH execution, network latency variable
|
|
- Impact: Commands truncated mid-execution, incomplete status reports, timeouts on slow networks
|
|
- Improvement path: Increase timeout selectively, implement command queuing, cache results for frequently-called commands
|
|
|
|
## Fragile Areas
|
|
|
|
**Installer Shell Script with Unimplemented Sections:**
|
|
- Files: `pve-homelab-kit/install.sh` (495+ lines with TODO comments)
|
|
- Why fragile: Multiple TODO placeholders indicate incomplete implementation; wizard UI done but ~30 implementation TODOs remain
|
|
- Safe modification: (1) Don't merge branches without running through full install, (2) Test each section independently, (3) Add shell `set -e` error handling
|
|
- Test coverage: Script has no tests, no dry-run mode, no rollback capability
|
|
|
|
**Container Configuration Manual in LXC Config Files:**
|
|
- Files: `/etc/pve/lxc/*.conf` across Proxmox host (not in repo, not version controlled)
|
|
- Why fragile: Critical settings (features, ulimits, AppArmor) outside version control, drift risk after manual fixes
|
|
- Safe modification: Keep backup copies in `homelab-documentation.md` (already done for PBS), automate via Terraform/Ansible if future containers added
|
|
- Test coverage: Config changes only tested on live container (no staging env)
|
|
|
|
**Helper Scripts with Hardcoded IPs and Paths:**
|
|
- Files: `~/bin/updates` (lines 16-17, 130), `~/bin/pbs`, `~/bin/pve`, `~/bin/dns`
|
|
- Why fragile: DOCKGE_HOST, PVE_HOST hardcoded; if IPs change during migration, all scripts must be updated manually
|
|
- Safe modification: Extract to config file (e.g., `/etc/homelab/config.sh` or environment variables)
|
|
- Test coverage: Scripts tested against live infrastructure only
|
|
|
|
**SSH-Based Container Access Without Key Verification:**
|
|
- Files: `~/bin/updates` (lines 115-131), scripts use `-q` flag suppressing host key checks
|
|
- Why fragile: `ssh -q` disables StrictHostKeyChecking, vulnerable to MITM; scripts assume SSH keys are pre-installed
|
|
- Safe modification: Add `-o StrictHostKeyChecking=accept-new` to verify on first connection, document key distribution procedure
|
|
- Test coverage: SSH connectivity assumed working
|
|
|
|
**Backup Monitoring Without Alerting on Failure:**
|
|
- Files: `~/bin/pbs`, `telegram/bot.py` (status command only, no automatic failure alerts)
|
|
- Why fragile: Failed backups only visible if manually checked; no monitoring of backup completion
|
|
- Safe modification: Add systemd timer to check PBS status hourly, send Telegram alert on failure
|
|
- Test coverage: Manual checks only
|
|
|
|
## Scaling Limits
|
|
|
|
**Container IP Space Exhaustion:**
|
|
- Current capacity: vmbr1 is /24 (256 IPs, .0-.255), DHCP range .100-.200 (101 IPs available for DHCP), static IPs scattered
|
|
- Limit: After ~150 containers, IP fragmentation becomes difficult to manage; DHCP range conflicts with static allocation
|
|
- Scaling path: (1) Implement TODO IP scheme (VMID=IP), (2) Expand to /23 (512 IPs) if more containers needed, (3) Use vmbr2 (vSwitch) for secondary network
|
|
|
|
**Backup Datastore Single Synology Volume:**
|
|
- Current capacity: Synology `pbs-backup` share unknown size (not documented)
|
|
- Limit: Unknown when share becomes full; no warning system implemented
|
|
- Scaling path: (1) Document share capacity in homelab-documentation.md, (2) Add usage monitoring to `beszel` or Uptime Kuma, (3) Plan expansion to second NAS
|
|
|
|
**Dockge Stack Limit:**
|
|
- Current capacity: Dockge container 101 running ~8-10 stacks visible in documentation
|
|
- Limit: No documented resource constraints; may hit CPU/RAM limits on Hetzner AX52 with more containers
|
|
- Scaling path: (1) Monitor Dockge resource usage via Beszel, (2) Profile Dragonfly memory usage, (3) Plan VM migration for heavy workloads
|
|
|
|
**DNS Query Throughput:**
|
|
- Current capacity: Single Technitium container handling all internal DNS
|
|
- Limit: Container CPU/RAM limits unknown; no QPS monitoring
|
|
- Scaling path: (1) Add DNS replica, (2) Monitor query latency, (3) Profile Technitium logs for slow queries
|
|
|
|
## Dependencies at Risk
|
|
|
|
**Technitium DNS (Unmaintained Risk):**
|
|
- Risk: TechnitiumSoftware/DnsServer has irregular commit history; last significant release early 2024
|
|
- Impact: Security fixes may be delayed; compatibility with newer Linux kernels unknown
|
|
- Migration plan: (1) Profile current Technitium features used, (2) Evaluate CoreDNS or Dnsmasq alternatives, (3) Plan gradual migration with dual DNS
|
|
|
|
**DragonflyDB as Redis Replacement:**
|
|
- Risk: Dragonfly smaller ecosystem than Redis; breaking changes possible in minor updates
|
|
- Impact: Applications expecting Redis behavior may fail; less community support for issues
|
|
- Migration plan: (1) Pin Dragonfly version in compose file (currently `latest`), (2) Test upgrades in dev environment, (3) Document any API incompatibilities found
|
|
|
|
**Dockge (Single Maintainer Project):**
|
|
- Risk: Dockge maintained by one developer (louislam); bus factor high
|
|
- Impact: If maintainer loses interest, fixes and features stop; dependency on their release schedule
|
|
- Migration plan: (1) Use Dockge for UI only, don't depend on it for production orchestration, (2) Keep docker-compose expertise on team, (3) Consider Portainer as fallback alternative
|
|
|
|
**Forgejo (Younger than Gitea):**
|
|
- Risk: Forgejo is recent fork of Gitea; database schema changes possible in patch versions
|
|
- Impact: Upgrades may require manual migrations; data loss risk if migration fails
|
|
- Migration plan: (1) Test Forgejo upgrades on backup copy first, (2) Document upgrade procedure, (3) Keep Gitea as fallback if Forgejo breaks
|
|
|
|
## Missing Critical Features
|
|
|
|
**No Automated Health Monitoring/Alerting:**
|
|
- Problem: Status checks exist (via Telegram bot, Uptime Kuma) but no automatic alerts when services fail
|
|
- Blocks: Cannot sleep soundly; must manually check status to detect outages
|
|
- Implementation path: (1) Add Uptime Kuma HTTP monitors for all public services, (2) Create Telegram alert webhook, (3) Monitor PBS backup success daily
|
|
|
|
**No Automated Certificate Renewal Verification:**
|
|
- Problem: NPM handles Let's Encrypt renewal, but no monitoring for renewal failures
|
|
- Blocks: Certificates could expire silently; discovered during service failures
|
|
- Implementation path: (1) Add Uptime Kuma alert for HTTP 200 on https://* services, (2) Add monthly certificate expiry check, (3) Set up renewal failure alerts
|
|
|
|
**No Disaster Recovery Runbook:**
|
|
- Problem: Procedures for rescuing locked-out server (Hetzner Rescue Mode) not documented
|
|
- Blocks: If SSH access lost, cannot recover without external procedures
|
|
- Implementation path: (1) Document Hetzner Rescue Mode recovery steps, (2) Create network reconfiguration backup procedures, (3) Test rescue mode monthly
|
|
|
|
**No Change Log / Audit Trail:**
|
|
- Problem: Infrastructure changes not logged; drift from documentation occurs silently
|
|
- Blocks: Unknown who made changes, when, and why; cannot track config evolution
|
|
- Implementation path: (1) Add git commit requirement for all manual changes, (2) Create change notification to Telegram, (3) Weekly drift detection report
|
|
|
|
**No Secrets Management System:**
|
|
- Problem: Credentials scattered across plaintext files, git history, and documentation
|
|
- Blocks: Cannot safely share access with team members; no credential rotation capability
|
|
- Implementation path: (1) Deploy HashiCorp Vault or Vaultwarden, (2) Migrate all secrets to vault, (3) Create credential rotation procedures
|
|
|
|
## Test Coverage Gaps
|
|
|
|
**PBS Backup Restore Not Tested:**
|
|
- What's not tested: Full restore procedures; assumed to work but never verified
|
|
- Files: `homelab-documentation.md` (lines 325-392), no restore test documented
|
|
- Risk: If restore needed, may discover issues during actual data loss emergency
|
|
- Priority: HIGH - Add monthly restore test procedure (restore single VM to temporary location, verify data integrity)
|
|
|
|
**Network Failover Scenarios:**
|
|
- What's not tested: What happens if Tailscale relay (1000) goes down, if NPM container restarts, if DNS returns SERVFAIL
|
|
- Files: No documented failure scenarios
|
|
- Risk: Unknown recovery time; applications may hang instead of failing gracefully
|
|
- Priority: HIGH - Document and test each service's failure mode
|
|
|
|
**Helper Script Error Handling:**
|
|
- What's not tested: Scripts with SSH timeouts, host unreachable, malformed responses
|
|
- Files: `~/bin/updates`, `~/bin/pbs`, `~/bin/pve` (error handling exists but not tested against failures)
|
|
- Risk: Silent failures could go unnoticed; incomplete output returned to caller
|
|
- Priority: MEDIUM - Add error injection tests (mock SSH failures)
|
|
|
|
**Telegram Bot Commands Under Load:**
|
|
- What's not tested: Bot response when running concurrent commands, or when helper scripts timeout
|
|
- Files: `telegram/bot.py` (no load tests, concurrency behavior unknown)
|
|
- Risk: Bot may hang or lose messages under heavy load
|
|
- Priority: MEDIUM - Add load test with 10+ concurrent commands
|
|
|
|
**Container Migration (VMID IP Scheme Change):**
|
|
- What's not tested: Migration of 15+ containers to new IP scheme; full rollback procedures
|
|
- Files: `TODO.md` (line 5-15, planned but not executed)
|
|
- Risk: Single IP misconfiguration could take multiple services offline
|
|
- Priority: HIGH - Create detailed migration runbook with rollback at each step before executing
|
|
|
|
---
|
|
|
|
*Concerns audit: 2026-02-04*
|