felt/docs/felt_capacity_model.py

337 lines
17 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env python3
"""
Felt Infrastructure Capacity Planning
Target server: Hetzner AX102 or similar
- 16c/32t (AMD Ryzen 9 7950X3D or similar)
- 64-128GB DDR5 RAM
- 2× 1TB NVMe (RAID1 or split)
- ~€100/mo
Architecture:
1. Core Server(s) - handles NATS, PostgreSQL, API, admin dashboard
2. Virtual Leaf Server(s) - runs virtual Leaf instances for free tier
3. Pro venues have physical Leaf nodes — Core just receives NATS sync
Key insight: Virtual Leafs are HEAVIER than Pro sync because they
run the full tournament engine. Pro Leafs run on-premise — Core
just stores their sync data.
"""
print("=" * 70)
print("FELT INFRASTRUCTURE CAPACITY PLANNING")
print("=" * 70)
print("""
┌─────────────────────────────────────────────────────────────────┐
│ ARCHITECTURE OVERVIEW │
│ │
│ FREE TIER (Virtual Leaf) PRO TIER (Physical Leaf) │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Go backend │ ← runs on │ Go backend │ ← runs on │
│ │ SQLite DB │ OUR server │ SQLite DB │ THEIR hw │
│ │ WebSocket │ │ NVMe storage │ │
│ │ NATS client │ │ NATS client │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ full engine │ sync only │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ CORE SERVER │ │
│ │ NATS JetStream │ PostgreSQL │ API │ Admin UI │ │
│ └────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
""")
# ================================================================
# VIRTUAL LEAF RESOURCE ESTIMATES
# ================================================================
print("=" * 70)
print("VIRTUAL LEAF (Free Tier) — Resource Estimates")
print("=" * 70)
print("""
Each Virtual Leaf runs:
- Go binary (tournament engine + HTTP server + WebSocket hub)
- SQLite database (venue's tournaments, players, results)
- NATS client (sync to Core when needed)
- WebSocket connections (operator UI + player mobile + display proxy)
Per Virtual Leaf process:
""")
# Memory estimates
go_binary_base = 30 # MB - Go runtime + loaded binary
sqlite_working_set = 10 # MB - SQLite in-memory cache (small venues)
websocket_per_conn = 0.05 # MB per WebSocket connection
avg_concurrent_ws = 20 # During a tournament: 1 operator + ~15 players + displays
ws_memory = websocket_per_conn * avg_concurrent_ws
nats_client = 5 # MB - NATS client overhead
http_buffers = 10 # MB - HTTP server buffers, template cache
signage_content = 5 # MB - Cached signage content bundles
total_per_vleaf_active = go_binary_base + sqlite_working_set + ws_memory + nats_client + http_buffers + signage_content
total_per_vleaf_idle = go_binary_base + 5 + nats_client # Idle: minimal working set
print(f" Component Active Idle")
print(f" ─────────────────────────────────────────────")
print(f" Go runtime + binary {go_binary_base:>5} MB {go_binary_base:>5} MB")
print(f" SQLite working set {sqlite_working_set:>5} MB {5:>5} MB")
print(f" WebSocket ({avg_concurrent_ws} conns) {ws_memory:>5.1f} MB {0:>5} MB")
print(f" NATS client {nats_client:>5} MB {nats_client:>5} MB")
print(f" HTTP buffers/cache {http_buffers:>5} MB {2:>5} MB")
print(f" Signage content cache {signage_content:>5} MB {0:>5} MB")
print(f" ─────────────────────────────────────────────")
print(f" TOTAL per Virtual Leaf {total_per_vleaf_active:>5.1f} MB {go_binary_base + 5 + nats_client:>5} MB")
# CPU estimates
print(f"""
CPU usage:
Idle (no tournament): ~0.01 cores (just NATS heartbeat)
Active tournament: ~0.05-0.1 cores (timer ticks, WS pushes)
Peak (level change): ~0.2 cores (burst: recalculate, push all)
Signage editor (AI): ~0.5 cores (burst, rare)
Disk usage per venue:
SQLite database: 1-50 MB (scales with history)
Signage content: 10-100 MB (images, templates)
Average per venue: ~50 MB
""")
# ================================================================
# PRO VENUE CORE LOAD
# ================================================================
print("=" * 70)
print("PRO VENUE — Core Server Load (sync only)")
print("=" * 70)
print("""
Pro venues run their own Leaf hardware. Core only handles:
- NATS JetStream: receive sync messages (tournament results,
player updates, financial data) — async, bursty, small payloads
- PostgreSQL: upsert synced data into venue's partition
- API: serve admin dashboard, player profiles, public venue page
- Reverse proxy config: Netbird routing for player/operator access
Per Pro venue on Core:
""")
nats_per_pro = 2 # MB - JetStream consumer state + buffers
pg_per_pro = 5 # MB - PostgreSQL working set per venue
api_per_pro = 1 # MB - Cached API responses
total_pro_core = nats_per_pro + pg_per_pro + api_per_pro
print(f" Component Memory")
print(f" ───────────────────────────────────")
print(f" NATS consumer state {nats_per_pro:>5} MB")
print(f" PostgreSQL working set {pg_per_pro:>5} MB")
print(f" API cache {api_per_pro:>5} MB")
print(f" ───────────────────────────────────")
print(f" TOTAL per Pro venue {total_pro_core:>5} MB")
print(f"""
CPU usage per Pro venue:
Idle (between syncs): ~0.001 cores (negligible)
Active sync burst: ~0.02 cores (deserialize + upsert)
API requests: ~0.01 cores (occasional dashboard/mobile)
Disk usage per venue (PostgreSQL):
Average: 10-200 MB (scales with history)
With years of data: up to 500 MB
""")
# ================================================================
# SERVER CAPACITY CALCULATIONS
# ================================================================
print("=" * 70)
print("SERVER CAPACITY — Hetzner 16c/32t, 64GB RAM, 2×1TB NVMe")
print("=" * 70)
total_ram_mb = 64 * 1024 # 64 GB
total_cores = 32 # threads
total_disk_gb = 2000 # 2× 1TB (RAID1 = 1TB usable, or split = 2TB)
server_cost = 100 # €/mo
# OS + base services overhead
os_overhead_ram = 2048 # MB
nats_server_ram = 512 # MB - NATS JetStream server
pg_server_ram = 4096 # MB - PostgreSQL shared buffers + overhead
netbird_ram = 256 # MB - Netbird controller
api_server_ram = 512 # MB - Core API + admin frontend
monitoring_ram = 512 # MB - Prometheus, logging
reserve_ram = 2048 # MB - headroom / burst
base_overhead = os_overhead_ram + nats_server_ram + pg_server_ram + netbird_ram + api_server_ram + monitoring_ram + reserve_ram
available_ram = total_ram_mb - base_overhead
os_cores = 1
nats_cores = 1
pg_cores = 2
api_cores = 1
base_cores = os_cores + nats_cores + pg_cores + api_cores
available_cores = total_cores - base_cores
print(f"\n Base infrastructure overhead:")
print(f" OS + system: {os_overhead_ram:>6} MB {os_cores} cores")
print(f" NATS JetStream: {nats_server_ram:>6} MB {nats_cores} cores")
print(f" PostgreSQL: {pg_server_ram:>6} MB {pg_cores} cores")
print(f" Netbird controller: {netbird_ram:>6} MB")
print(f" Core API + admin: {api_server_ram:>6} MB {api_cores} cores")
print(f" Monitoring: {monitoring_ram:>6} MB")
print(f" Reserve/headroom: {reserve_ram:>6} MB")
print(f" ─────────────────────────────────────────")
print(f" Total overhead: {base_overhead:>6} MB {base_cores} cores")
print(f" Available for venues: {available_ram:>6} MB {available_cores} cores")
print(f" Available disk: ~{total_disk_gb // 2} GB (RAID1) or ~{total_disk_gb} GB (split)")
# ================================================================
# SCENARIO: DEDICATED VIRTUAL LEAF SERVER
# ================================================================
print(f"\n{'' * 70}")
print(f"SCENARIO A: Dedicated Virtual Leaf Server (free tier only)")
print(f"{'' * 70}")
# Key insight: most virtual Leafs are IDLE most of the time
# A venue running 3 tournaments/week has active tournaments maybe
# 12 hours/week = 7% of the time
active_ratio = 0.10 # 10% of virtual Leafs active at any given time
# Memory: need to keep all Leafs loaded (Go processes)
# but SQLite working sets and WS connections only for active ones
mem_per_vleaf = total_per_vleaf_idle # Base: all idle
mem_per_active_delta = total_per_vleaf_active - total_per_vleaf_idle # Extra when active
# CPU: only active Leafs consume meaningful CPU
cpu_per_active = 0.1 # cores per active tournament
# Solve for max Virtual Leafs (memory-bound)
# N * idle_mem + N * active_ratio * active_delta <= available_ram
# N * (idle_mem + active_ratio * active_delta) <= available_ram
effective_mem_per_vleaf = mem_per_vleaf + active_ratio * mem_per_active_delta
max_vleaf_by_ram = int(available_ram / effective_mem_per_vleaf)
# Check CPU bound
max_active = int(available_cores / cpu_per_active)
max_vleaf_by_cpu = int(max_active / active_ratio)
max_vleaf = min(max_vleaf_by_ram, max_vleaf_by_cpu)
print(f"\n Assumptions:")
print(f" Active ratio (tournaments running): {active_ratio:.0%}")
print(f" Memory per Leaf (idle): {mem_per_vleaf:.0f} MB")
print(f" Memory per Leaf (active delta): {mem_per_active_delta:.0f} MB")
print(f" Effective memory per Leaf: {effective_mem_per_vleaf:.0f} MB")
print(f" CPU per active tournament: {cpu_per_active} cores")
print(f"")
print(f" Capacity (RAM-limited): {max_vleaf_by_ram} Virtual Leafs")
print(f" Capacity (CPU-limited): {max_vleaf_by_cpu} Virtual Leafs")
print(f" ═══════════════════════════════════════════")
print(f" PRACTICAL CAPACITY: ~{max_vleaf} Virtual Leafs per server")
print(f" At {active_ratio:.0%} active: ~{int(max_vleaf * active_ratio)} concurrent tournaments")
print(f" Server cost: €{server_cost}/mo")
print(f" Cost per Virtual Leaf: €{server_cost/max_vleaf:.2f}/mo")
# Disk check
avg_disk_per_vleaf = 50 # MB
total_disk_used = max_vleaf * avg_disk_per_vleaf / 1024
print(f" Disk usage ({max_vleaf} venues): ~{total_disk_used:.0f} GB (well within {total_disk_gb//2} GB)")
# ================================================================
# SCENARIO: CORE SERVER (Pro venues + API + services)
# ================================================================
print(f"\n{'' * 70}")
print(f"SCENARIO B: Core Server (Pro venue sync + API + all services)")
print(f"{'' * 70}")
max_pro_by_ram = int(available_ram / total_pro_core)
max_pro_by_cpu = int(available_cores / 0.03) # ~0.03 cores average per Pro venue
max_pro = min(max_pro_by_ram, max_pro_by_cpu)
print(f"\n Per Pro venue on Core: {total_pro_core} MB RAM, ~0.03 cores avg")
print(f" Capacity (RAM-limited): {max_pro_by_ram} Pro venues")
print(f" Capacity (CPU-limited): {max_pro_by_cpu} Pro venues")
print(f" ═══════════════════════════════════════════")
print(f" PRACTICAL CAPACITY: ~{min(max_pro, 5000)} Pro venues per server")
print(f" (PostgreSQL becomes the bottleneck before RAM/CPU)")
print(f" Server cost: €{server_cost}/mo")
print(f" Cost per Pro venue: €{server_cost/min(max_pro,2000):.2f}/mo")
# ================================================================
# COMBINED: REALISTIC DEPLOYMENT TOPOLOGY
# ================================================================
print(f"\n{'=' * 70}")
print(f"RECOMMENDED DEPLOYMENT TOPOLOGY")
print(f"{'=' * 70}")
print(f"""
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: Single Server (~€100/mo) │
│ │
│ One Hetzner box runs EVERYTHING: │
│ Core (NATS + PostgreSQL + API) + Virtual Leafs │
│ │
│ Capacity: ~300-400 Virtual Leafs + ~200 Pro venues │
│ This gets you through Year 1-2 easily. │
│ │
│ Actual cost per venue: │
│ If 200 free + 20 Pro: €100 / 220 = €0.45/venue/mo │
│ Way under the €4/mo we budgeted per free venue! │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: Split (2 servers, ~€200/mo) │
│ │
│ Server 1: Core (NATS + PG + API + Netbird controller) │
│ Server 2: Virtual Leaf farm │
│ │
│ Capacity: ~{max_vleaf} Virtual Leafs + ~2000 Pro venues │
│ Split when you hit ~400 free venues or need more RAM. │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ PHASE 3: Regional (3+ servers, ~€300+/mo) │
│ │
│ EU: Core + VLeaf farm (Hetzner Falkenstein) │
│ US: Core replica + VLeaf farm (Hetzner Ashburn) │
│ APAC: Core replica + VLeaf farm (OVH Singapore) │
│ │
│ NATS super-cluster for cross-region sync │
│ Player profiles: globally replicated │
│ Venue data: stays in region (GDPR) │
└─────────────────────────────────────────────────────────────┘
""")
# ================================================================
# THE PUNCHLINE
# ================================================================
print(f"{'=' * 70}")
print(f"THE PUNCHLINE")
print(f"{'=' * 70}")
year1_free = 200
year1_pro = 20
year1_total = year1_free + year1_pro
year1_servers = 1
year1_infra = 100
print(f"""
Year 1 reality: {year1_total} venues ({year1_free} free + {year1_pro} Pro)
Infrastructure: {year1_servers} server @ €{year1_infra}/mo
Actual cost per free venue: €{year1_infra / year1_total:.2f}/mo (not €4!)
Actual cost per Pro venue: €{year1_infra / year1_total:.2f}/mo (not €6!)
We budgeted €4/mo per free venue. Reality is €0.45/mo.
That means the free tier is ~9× cheaper than we estimated.
The original financial model is CONSERVATIVE.
With real infrastructure costs:
Year 1 net improvement: +€{(4 - year1_infra/year1_total) * year1_free * 12:.0f}/yr saved on free tier
This doesn't change when you hit salary targets,
but it means your runway is much longer and your
margins are much better than the financial model suggests.
One €100/mo server handles your first ~500 venues.
You won't need a second server until you're already profitable.
""")