From 2bb381a0a3fd9ee1c6688124b948053a69029d77 Mon Sep 17 00:00:00 2001 From: Mikkel Georgsen Date: Sun, 8 Feb 2026 03:06:53 +0100 Subject: [PATCH] Update tech stack research with finalized decisions Resolve all open questions from tech stack review: - Self-hosted on Hetzner PVE (LXC + Docker) - Event-based sync via NATS JetStream - Generic display system with Android client (no Cast SDK dep) - Docker-based RPi5 provisioning - No money handling, 72h offline limit, REST + OpenAPI - PVM signup-first for player accounts Co-Authored-By: Claude Opus 4.6 --- docs/TECH_STACK_RESEARCH.md | 273 +++++++++++++++++++----------------- 1 file changed, 148 insertions(+), 125 deletions(-) diff --git a/docs/TECH_STACK_RESEARCH.md b/docs/TECH_STACK_RESEARCH.md index 229cafa..0ffaf11 100644 --- a/docs/TECH_STACK_RESEARCH.md +++ b/docs/TECH_STACK_RESEARCH.md @@ -579,11 +579,12 @@ On reconnect: ### RPi5 System Setup - **OS**: Raspberry Pi OS Lite (64-bit, Debian Bookworm-based) — no desktop environment -- **Storage**: 32 GB+ microSD or USB SSD (recommended for durability) -- **Auto-start**: systemd service for the PVM binary -- **Updates**: OTA binary updates via a self-update mechanism (download new binary, verify signature, swap, restart) -- **Watchdog**: Hardware watchdog timer to auto-reboot if the process hangs -- **Networking**: Ethernet preferred (reliable), WiFi as fallback. mDNS for local discovery. +- **Runtime**: Docker + Docker Compose. Two containers: `pvm-node` (Rust binary) + `pvm-nats-leaf` (NATS) +- **Storage**: 32 GB+ microSD or USB SSD (recommended for durability). libSQL database in a Docker volume. +- **Auto-start**: Docker Compose with `restart: always`. systemd service ensures Docker starts on boot. +- **Updates**: `docker compose pull && docker compose up -d` — automated via cron or webhook from cloud. +- **Watchdog**: Docker health checks + hardware watchdog timer to auto-reboot if containers fail. +- **Networking**: Ethernet preferred (reliable), WiFi as fallback. mDNS for local display device discovery. WireGuard tunnel to Hetzner cloud. ### Gotchas @@ -595,75 +596,97 @@ On reconnect: --- -## 11. Chromecast / Display Streaming +## 11. Venue Display System -### Recommendation: **Google Cast SDK** with a **Custom Web Receiver** (SvelteKit static app) +### Recommendation: **Generic web display app** + **Android display client** (no Google Cast SDK dependency) ### Architecture ``` -┌──────────────┐ Cast SDK ┌──────────────────┐ -│ Sender App │ ──────────────► │ Custom Web │ -│ (PVM Admin │ (discovers & │ Receiver │ -│ Dashboard) │ launches) │ (SvelteKit SPA) │ -│ │ │ │ -│ or │ │ Hosted at: │ -│ │ │ cast.pvmapp.com │ -│ Local Node │ │ │ -│ HTTP Server │ │ Connects to WS │ -│ │ │ for live updates │ -└──────────────┘ └────────┬───────────┘ - │ - ┌────────▼───────────┐ - │ Chromecast Device │ - │ (renders receiver) │ - └────────────────────┘ +┌──────────────────┐ +│ Screen Manager │ (part of admin dashboard) +│ - Assign streams │ Venue staff assigns content to each display +│ - Per-TV config │ +└────────┬─────────┘ + │ WebSocket (display assignment) + ▼ +┌──────────────────┐ ┌──────────────────┐ +│ Local RPi5 Node │◄─ mDNS─┤ Display Devices │ +│ serves display │ auto │ (Android box / │ +│ web app + WS │ disco │ smart TV / │ +│ │─────────► Chromecast) │ +└────────┬─────────┘ └──────────────────┘ + │ │ + if offline: fallback: + serves locally connect to cloud + │ SaaS URL directly + ▼ │ +┌──────────────────┐ ┌───────▼──────────┐ +│ Display renders │ │ Display renders │ +│ from local node │ │ from cloud │ +└──────────────────┘ └──────────────────┘ ``` -### Custom Web Receiver +### Display Client (Android App) -The Cast receiver is a **separate SvelteKit static app** that: +A lightweight Android app (or a $40 4K Android box) that: -1. Loads on the Chromecast device when cast is initiated -2. Connects to the PVM WebSocket endpoint (cloud or local node, depending on network) -3. Subscribes to venue-specific events (tournament clock, waitlist, seat map) -4. Renders full-screen display layouts: - - **Tournament clock**: Large timer, current level, blind structure, next break - - **Waiting list**: Player queue by game type, estimated wait times - - **Table status**: Open seats, game types, stakes per table - - **Custom messages**: Announcements, promotions +1. **Auto-starts on boot** — kiosk mode, no user interaction needed +2. **Discovers the local node via mDNS** — zero-config for venue staff, falls back to manual IP entry +3. **Registers with a unique device ID** — appears automatically in the Screen Manager dashboard +4. **Receives display assignment via WebSocket** — the system tells it what to render +5. **Renders a full-screen web page** — the display content is a standard SvelteKit static page +6. **Falls back to cloud SaaS** if the local RPi5 node is offline +7. **Remotely controllable** — venue staff can change the stream, restart, or push an announcement overlay from the Screen Manager -### Display Manager +### Display Content (SvelteKit Static App) -A venue can have **multiple Chromecast devices** showing different content: +The display views are a **separate SvelteKit static build** optimized for large screens: -- TV 1: Tournament clock (main) -- TV 2: Cash game waiting list -- TV 3: Table/seat map -- TV 4: Rotating between tournament clock and waiting list +- **Tournament clock**: Large timer, current level, blind structure, next break, average stack +- **Waiting list**: Player queue by game type, estimated wait times +- **Table status**: Open seats, game types, stakes per table +- **Seatings**: Tournament seat assignments after draws +- **Custom slideshow**: Announcements, promotions, venue info (managed by staff) +- **Rotation mode**: Cycle between multiple views on a configurable timer -The **Display Manager** (part of the admin dashboard) lets floor managers: -- Assign content to each Chromecast device -- Configure rotation/cycling between views -- Send one-time announcements to all screens +### Screen Manager + +The **Screen Manager** (part of the admin dashboard) lets floor managers: + +- See all connected display devices with status (online, offline, content) +- Assign content streams to each device (TV 1-5: tournament clock, TV 6: waitlist, etc.) +- Configure rotation/cycling between views per device +- Send one-time announcements to all screens or specific screens - Adjust display themes (dark/light, font size, venue branding) +- Group screens (e.g. "Tournament Area", "Cash Room", "Lobby") ### Technical Details -- Register the receiver app with Google Cast Developer Console (one-time setup, $5 fee) -- Use Cast Application Framework (CAF) Receiver SDK v3 -- The receiver app is a standard web page — can use any web framework (SvelteKit static build) -- Sender integration: use the `cast.framework.CastContext` API in the admin dashboard -- For **local network casting** (offline mode): the local node serves the receiver app directly, and the Chromecast connects to the local node's IP -- Consider also supporting **generic HDMI displays** via a simple browser in kiosk mode (Chromium on a secondary RPi or mini PC) as a non-Chromecast fallback +- Display web app is served by the local node's HTTP server (Axum) for lowest latency +- WebSocket connection for live data updates (tournament clock ticks, waitlist changes) +- Each display device is identified by a stable device ID (generated on first boot, persisted) +- mDNS service type: `_pvm-display._tcp.local` for auto-discovery +- Display URLs: `http://{local-node-ip}/display/{device-id}` (local) or `https://app.pvmapp.com/display/{device-id}` (cloud fallback) +- Dark mode by default (poker venues are low-light environments) +- Large fonts, high contrast — designed for viewing from across the room + +### Chromecast Compatibility + +Chromecast is supported as a **display target** but not the primary architecture: + +- Smart TVs with built-in Chromecast or attached Chromecast dongles can open the display URL +- No Google Cast SDK dependency — just opening a URL +- The Android display client app is the recommended approach for reliability and offline support ### Gotchas -- Chromecast devices have limited memory and CPU — keep the receiver app lightweight (Svelte is ideal here) -- Cast sessions can timeout after inactivity — implement keep-alive messages -- Chromecast requires an internet connection for initial app load (it fetches the receiver URL from Google's servers) — for fully offline venues, the kiosk-mode browser fallback is essential -- Test on actual Chromecast hardware early — the developer emulator doesn't catch all issues -- Cast SDK requires HTTPS for the receiver URL in production (self-signed certs won't work on Chromecast) +- Android kiosk mode needs careful implementation — prevent users from exiting the app, handle OS updates gracefully +- mDNS can be unreliable on some enterprise/venue networks — always offer manual IP fallback +- Display devices on venue WiFi may have intermittent connectivity — design for reconnection and state catch-up +- Keep the display app extremely lightweight — some $40 Android boxes have limited RAM +- Test on actual cheap Android hardware early — performance varies wildly +- Power cycling (venue closes nightly) must be handled gracefully — auto-start, auto-reconnect, auto-resume --- @@ -710,71 +733,76 @@ PVM's mobile needs are primarily **consumption-oriented** — players check tour ## 13. Deployment & Infrastructure -### Recommendation: **Fly.io** (primary cloud) + **Docker** containers + **GitHub Actions** CI/CD - -### Alternatives Considered - -| Platform | Pros | Cons | -|----------|------|------| -| **Fly.io** | Edge deployment, built-in Postgres, simple scaling, good pricing, Rust-friendly | CLI-first workflow, no built-in CI/CD | -| **Railway** | Excellent DX, GitHub integration, preview environments | Less edge presence, newer | -| **AWS (ECS/Fargate)** | Full control, enterprise grade, broadest service catalog | Complex, expensive operations overhead | -| **Render** | Simple, good free tier | Less flexible networking, no edge | -| **Hetzner + manual** | Cheapest, full control | Operations burden, no managed services | +### Recommendation: **Self-hosted on Hetzner PVE** (LXC containers) + **Docker** + **Forgejo Actions** CI/CD ### Reasoning -**Fly.io** is the best fit for PVM: +The project already has a Hetzner Proxmox VE (PVE) server. Running PVM in LXC containers on the existing infrastructure keeps costs minimal and gives full control. -1. **Edge deployment**: Fly.io runs containers close to users. For a poker venue SaaS with venues in multiple cities/countries, edge deployment means lower latency for real-time tournament updates. -2. **Built-in Postgres**: Fly Postgres is managed, with automatic failover and point-in-time recovery. -3. **Fly Machines**: Fine-grained control over machine placement — can run NATS, DragonflyDB, and the API server as separate Fly machines. -4. **Rust-friendly**: Fly.io's multi-stage Docker builds work well for Rust (build on large machine, deploy tiny binary). -5. **Private networking**: Fly's WireGuard mesh enables secure communication between services without exposing ports publicly. The RPi5 local nodes can use Fly's WireGuard to connect to the cloud NATS cluster. -6. **Reasonable pricing**: Pay-as-you-go, no minimum commitment. Scale to zero for staging environments. +1. **LXC containers on PVE**: Lightweight, near-native performance, easy to snapshot and backup. Each service gets its own container or Docker runs inside an LXC. +2. **Docker Compose for services**: All cloud services defined in a single `docker-compose.yml`. Simple to start, stop, and update. +3. **No vendor lock-in**: Everything runs on standard Linux + Docker. Can migrate to any cloud or other bare metal trivially. +4. **WireGuard for RPi5 connectivity**: RPi5 local nodes connect to the Hetzner server via WireGuard tunnel for secure NATS leaf node communication. +5. **Forgejo Actions**: CI/CD runs on the same Forgejo instance hosting the code. ### Infrastructure Layout ``` -Fly.io Cloud -├── pvm-api (Axum, 2+ instances, auto-scaled) -├── pvm-ws-gateway (Axum WebSocket, 2+ instances) -├── pvm-nats (NATS cluster, 3 nodes) -├── pvm-db (Fly Postgres, primary + replica) -├── pvm-cache (DragonflyDB, single node) -└── pvm-worker (background jobs: sync processing, notifications) +Hetzner PVE Server +├── LXC: pvm-cloud +│ ├── Docker: pvm-api (Axum) +│ ├── Docker: pvm-ws-gateway (Axum WebSocket) +│ ├── Docker: pvm-worker (background jobs: sync, notifications) +│ ├── Docker: pvm-nats (NATS cluster) +│ ├── Docker: pvm-db (PostgreSQL 16) +│ └── Docker: pvm-cache (DragonflyDB) +├── LXC: pvm-staging (mirrors production for testing) +└── WireGuard endpoint for RPi5 nodes -Venue (RPi5) -└── pvm-node (single Rust binary + NATS leaf node) - └── connects to pvm-nats via WireGuard/TLS +Venue (RPi5 — Docker on Raspberry Pi OS) +├── Docker: pvm-node (Rust binary — API proxy + sync engine) +├── Docker: pvm-nats-leaf (NATS leaf node) +└── connects to Hetzner via WireGuard/TLS ``` -### CI/CD Pipeline (GitHub Actions) +### RPi5 Local Node (Docker-based) + +The local node runs **Docker on stock Raspberry Pi OS (64-bit)**: + +- **Provisioning**: One-liner curl script installs Docker and pulls the PVM stack (`docker compose pull && docker compose up -d`) +- **Updates**: Pull new images and restart (`docker compose pull && docker compose up -d`). Automated via a cron job or self-update webhook. +- **Rollback**: Previous images remain on disk. Roll back with `docker compose up -d --force-recreate` using pinned image tags. +- **Services**: `pvm-node` (Rust binary) + `pvm-nats-leaf` (NATS leaf node). Two containers, minimal footprint. +- **Storage**: libSQL database stored in a Docker volume on the SD card (or USB SSD for heavy-write venues). + +### CI/CD Pipeline (Forgejo Actions) ```yaml # Triggered on push to main -1. Lint (clippy, eslint) -2. Test (cargo test, vitest, playwright) -3. Build (multi-stage Docker for cloud, cross-compile for RPi5) -4. Deploy staging (auto-deploy to Fly.io staging) -5. E2E tests against staging -6. Deploy production (manual approval gate) -7. Publish RPi5 binary (signed, to update server) +1. Lint (clippy, biome) +2. Test (cargo nextest, vitest, playwright) +3. Build (multi-stage Docker for cloud + cross-compile ARM64 for RPi5) +4. Push images to container registry +5. Deploy staging (docker compose pull on staging LXC) +6. E2E tests against staging +7. Deploy production (manual approval, docker compose on production LXC) +8. Publish RPi5 images (ARM64 Docker images to registry) ``` ### Gotchas -- Fly.io Postgres is not fully managed — you still need to handle major version upgrades and backup verification -- Use multi-stage Docker builds to keep Rust image sizes small (builder stage with `rust:bookworm`, runtime stage with `debian:bookworm-slim` or `distroless`) -- Pin Fly.io machine regions to match your target markets — don't spread too thin initially -- Set up blue-green deployments for zero-downtime upgrades -- The RPi5 binary update mechanism needs a rollback strategy — keep the previous binary and a fallback boot option +- Use multi-stage Docker builds for Rust: builder stage with `rust:bookworm`, runtime stage with `debian:bookworm-slim` or `distroless` +- PostgreSQL backups: automate `pg_dump` to a separate backup location (another Hetzner storage box or off-site) +- Set up blue-green deployments via Docker Compose profiles for zero-downtime upgrades +- Monitor Hetzner server resources — if PVM outgrows a single server, split services across multiple LXCs or servers +- WireGuard keys for RPi5 nodes: automate key generation and registration during provisioning +- The RPi5 Docker update mechanism needs a health check — if new images fail, auto-rollback to previous tag --- ## 14. Monitoring & Observability -### Recommendation: **OpenTelemetry** (traces + metrics + logs) exported to **Grafana Cloud** (or self-hosted Grafana + Loki + Tempo + Prometheus) +### Recommendation: **OpenTelemetry** (traces + metrics + logs) exported to **self-hosted Grafana + Loki + Tempo + Prometheus** (on Hetzner PVE) ### Alternatives Considered @@ -1142,9 +1170,9 @@ Venues should be able to customize their displays: | **Auth** | Custom JWT + RBAC | Offline-capable, cross-venue, full control | | **API Design** | REST + OpenAPI 3.1 | Generated TypeScript client, universal compatibility | | **Mobile** | PWA first, Capacitor later | One codebase, offline support, app store when needed | -| **Cast/Display** | Google Cast SDK + Custom Web Receiver | SvelteKit static app on Chromecast | -| **Deployment** | Fly.io + Docker | Edge deployment, managed Postgres, WireGuard | -| **CI/CD** | GitHub Actions + Turborepo | Cross-language build orchestration, caching | +| **Displays** | Generic web app + Android display client | No Cast SDK dependency, works offline, mDNS auto-discovery | +| **Deployment** | Hetzner PVE + Docker (LXC containers) | Self-hosted, full control, existing infrastructure | +| **CI/CD** | Forgejo Actions + Turborepo | Cross-language build orchestration, caching | | **Monitoring** | OpenTelemetry + Grafana | Vendor-neutral, excellent Rust support | | **Testing** | cargo-nextest + Vitest + Playwright | Full pyramid: unit, integration, E2E | | **Styling** | Tailwind CSS v4 | Fast, small bundles, Svelte-native | @@ -1153,38 +1181,33 @@ Venues should be able to customize their displays: --- -## Open Questions / Decisions Needed +## Decisions Made -### High Priority +> Resolved during tech stack review session, 2026-02-08. -1. **Fly.io vs. self-hosted**: Fly.io simplifies operations but creates vendor dependency. For a bootstrapped SaaS, the convenience is worth it. For VC-funded with an ops team, self-hosted on Hetzner could be cheaper at scale. **Decision: Start with Fly.io, design for portability.** +| # | Question | Decision | +|---|----------|----------| +| 1 | **Hosting** | **Self-hosted on Hetzner PVE** — LXC containers. Already have infrastructure. No Fly.io dependency. | +| 2 | **Sync strategy** | **Event-based sync via NATS JetStream** — all mutations are events, local node replays events to build state. Perfect audit trail. No table-vs-row debate. | +| 3 | **NATS on RPi5** | **Sidecar** — separate process managed by systemd/Docker. Independently upgradeable and monitorable. | +| 4 | **Financial data** | **No money handling at all.** Venues handle payments via their own POS systems (most are cash-based). PVM only tracks game data. | +| 5 | **Multi-region** | **Single region initially.** Design DB schema and NATS subjects for eventual multi-region without rewrite. | +| 6 | **Player accounts** | **PVM signup first.** Players always create a PVM account before joining venues. No deduplication problem. | +| 7 | **Display strategy** | **Generic web app + Android display client.** TVs run a simple Android app (or $40 Android box) that connects to the local node via mDNS auto-discovery, receives its display assignment via WebSocket, and renders a web page. Falls back to cloud SaaS if local node is offline. Chromecast is supported but not the primary path. No Google Cast SDK dependency. | +| 8 | **RPi5 provisioning** | **Docker on stock Raspberry Pi OS.** All PVM services (node, NATS) run as containers. Updates via image pulls. Provisioning is a one-liner curl script. | +| 9 | **Offline duration** | **72 hours.** Covers a full weekend tournament series. After 72h offline, warn staff but keep operating. Sync everything on reconnect. | +| 10 | **API style** | **REST + OpenAPI 3.1.** Auto-generated TypeScript client. Universal, debuggable, works with everything. | -2. **libSQL sync granularity**: Should the local node sync entire tables or individual rows? Row-level sync is more efficient but more complex to implement. **Recommendation: Start with table-level sync for the initial version, refine to row-level as data volumes grow.** +## Deferred Questions -3. **NATS embedded vs. sidecar on RPi5**: Running NATS as an embedded library (via `nats-server` Rust bindings) vs. a separate process. Embedded is simpler but couples versions tightly. **Recommendation: Sidecar (separate process managed by systemd) for operational flexibility.** +These remain open for future consideration: -4. **Financial data handling**: Does PVM handle actual money transactions, or only track buy-ins/credits as records? If handling real money, PCI DSS and financial regulations apply. **Recommendation: Track records only. Integrate with Stripe for actual payments.** +1. **API versioning strategy**: Maintain backward compatibility as long as possible. Only version on breaking changes. Revisit when approaching first external API consumers. -5. **Multi-region from day one?**: Should the initial architecture support venues in multiple countries/regions? This affects Postgres replication strategy and NATS cluster topology. **Recommendation: Single region initially, design NATS subjects and DB schema for eventual multi-region.** +2. **GraphQL for player-facing app**: REST is sufficient for v1. The player app might benefit from GraphQL's flexible querying later (e.g., "show me my upcoming tournaments across all venues with waitlist status"). **Revisit after v1 launch.** -### Medium Priority +3. **WebTransport**: When browser support matures, could replace WebSockets for lower-latency real-time streams. **Monitor but do not adopt yet.** -6. **Player account deduplication**: When a player signs up at two venues independently, how do we detect and merge accounts? Email match? Phone match? Manual linking? **Needs product decision.** +4. **WASM on local node**: Could parts of the frontend run on the local node via WASM for ultra-fast local rendering. **Defer.** -7. **Chromecast vs. generic display hardware**: Should the primary display strategy be Chromecast, or should we target a browser-in-kiosk-mode approach that also works with Chromecast? **Recommendation: Build the receiver as a standard web app first (works in kiosk mode), add Cast SDK integration second.** - -8. **RPi5 provisioning**: How are local nodes set up? Manual image flashing? Automated provisioning? Remote setup? **Recommendation: Pre-built OS image with first-boot wizard that connects to cloud and provisions the node.** - -9. **Offline duration limits**: How long should a local node operate offline before we consider the data stale? 1 hour? 1 day? 1 week? **Needs product decision based on venue feedback.** - -10. **API versioning strategy**: When do we introduce `/api/v2/`? Should we support multiple versions simultaneously? **Recommendation: Semantic versioning for the API spec. Maintain backward compatibility as long as possible. Only version on breaking changes.** - -### Low Priority - -11. **GraphQL for player-facing app**: The admin dashboard is well-served by REST, but the player app might benefit from GraphQL's flexible querying (e.g., "show me my upcoming tournaments across all venues with waitlist status"). **Revisit after v1 launch.** - -12. **WebTransport**: When browser support matures and Chromecast supports it, WebTransport could replace WebSockets for lower-latency, multiplexed real-time streams. **Monitor but do not adopt yet.** - -13. **WASM on local node**: Could parts of the frontend run on the local node via WASM for ultra-fast local rendering? Interesting but not a priority. **Defer.** - -14. **AI features**: Player behavior analytics, optimal table assignments, tournament structure recommendations. The data model should be designed to support future ML pipelines. **Design for it, build later.** +5. **AI features**: Player behavior analytics, optimal table assignments, tournament structure recommendations. The data model should be designed to support future ML pipelines. **Design for it, build later.**