Update tech stack research with finalized decisions

Resolve all open questions from tech stack review:
- Self-hosted on Hetzner PVE (LXC + Docker)
- Event-based sync via NATS JetStream
- Generic display system with Android client (no Cast SDK dep)
- Docker-based RPi5 provisioning
- No money handling, 72h offline limit, REST + OpenAPI
- PVM signup-first for player accounts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikkel Georgsen 2026-02-08 03:06:53 +01:00
parent cf03b3592a
commit 2bb381a0a3

View file

@ -579,11 +579,12 @@ On reconnect:
### RPi5 System Setup ### RPi5 System Setup
- **OS**: Raspberry Pi OS Lite (64-bit, Debian Bookworm-based) — no desktop environment - **OS**: Raspberry Pi OS Lite (64-bit, Debian Bookworm-based) — no desktop environment
- **Storage**: 32 GB+ microSD or USB SSD (recommended for durability) - **Runtime**: Docker + Docker Compose. Two containers: `pvm-node` (Rust binary) + `pvm-nats-leaf` (NATS)
- **Auto-start**: systemd service for the PVM binary - **Storage**: 32 GB+ microSD or USB SSD (recommended for durability). libSQL database in a Docker volume.
- **Updates**: OTA binary updates via a self-update mechanism (download new binary, verify signature, swap, restart) - **Auto-start**: Docker Compose with `restart: always`. systemd service ensures Docker starts on boot.
- **Watchdog**: Hardware watchdog timer to auto-reboot if the process hangs - **Updates**: `docker compose pull && docker compose up -d` — automated via cron or webhook from cloud.
- **Networking**: Ethernet preferred (reliable), WiFi as fallback. mDNS for local discovery. - **Watchdog**: Docker health checks + hardware watchdog timer to auto-reboot if containers fail.
- **Networking**: Ethernet preferred (reliable), WiFi as fallback. mDNS for local display device discovery. WireGuard tunnel to Hetzner cloud.
### Gotchas ### Gotchas
@ -595,75 +596,97 @@ On reconnect:
--- ---
## 11. Chromecast / Display Streaming ## 11. Venue Display System
### Recommendation: **Google Cast SDK** with a **Custom Web Receiver** (SvelteKit static app) ### Recommendation: **Generic web display app** + **Android display client** (no Google Cast SDK dependency)
### Architecture ### Architecture
``` ```
┌──────────────┐ Cast SDK ┌──────────────────┐ ┌──────────────────┐
│ Sender App │ ──────────────► │ Custom Web │ │ Screen Manager │ (part of admin dashboard)
│ (PVM Admin │ (discovers & │ Receiver │ │ - Assign streams │ Venue staff assigns content to each display
│ Dashboard) │ launches) │ (SvelteKit SPA) │ │ - Per-TV config │
│ │ │ │ └────────┬─────────┘
│ or │ │ Hosted at: │ │ WebSocket (display assignment)
│ │ │ cast.pvmapp.com │
│ Local Node │ │ │ ┌──────────────────┐ ┌──────────────────┐
│ HTTP Server │ │ Connects to WS │ │ Local RPi5 Node │◄─ mDNS─┤ Display Devices │
│ │ │ for live updates │ │ serves display │ auto │ (Android box / │
└──────────────┘ └────────┬───────────┘ │ web app + WS │ disco │ smart TV / │
│ │─────────► Chromecast) │
┌────────▼───────────┐ └────────┬─────────┘ └──────────────────┘
│ Chromecast Device │ │ │
│ (renders receiver) │ if offline: fallback:
└────────────────────┘ serves locally connect to cloud
│ SaaS URL directly
▼ │
┌──────────────────┐ ┌───────▼──────────┐
│ Display renders │ │ Display renders │
│ from local node │ │ from cloud │
└──────────────────┘ └──────────────────┘
``` ```
### Custom Web Receiver ### Display Client (Android App)
The Cast receiver is a **separate SvelteKit static app** that: A lightweight Android app (or a $40 4K Android box) that:
1. Loads on the Chromecast device when cast is initiated 1. **Auto-starts on boot** — kiosk mode, no user interaction needed
2. Connects to the PVM WebSocket endpoint (cloud or local node, depending on network) 2. **Discovers the local node via mDNS** — zero-config for venue staff, falls back to manual IP entry
3. Subscribes to venue-specific events (tournament clock, waitlist, seat map) 3. **Registers with a unique device ID** — appears automatically in the Screen Manager dashboard
4. Renders full-screen display layouts: 4. **Receives display assignment via WebSocket** — the system tells it what to render
- **Tournament clock**: Large timer, current level, blind structure, next break 5. **Renders a full-screen web page** — the display content is a standard SvelteKit static page
- **Waiting list**: Player queue by game type, estimated wait times 6. **Falls back to cloud SaaS** if the local RPi5 node is offline
- **Table status**: Open seats, game types, stakes per table 7. **Remotely controllable** — venue staff can change the stream, restart, or push an announcement overlay from the Screen Manager
- **Custom messages**: Announcements, promotions
### Display Manager ### Display Content (SvelteKit Static App)
A venue can have **multiple Chromecast devices** showing different content: The display views are a **separate SvelteKit static build** optimized for large screens:
- TV 1: Tournament clock (main) - **Tournament clock**: Large timer, current level, blind structure, next break, average stack
- TV 2: Cash game waiting list - **Waiting list**: Player queue by game type, estimated wait times
- TV 3: Table/seat map - **Table status**: Open seats, game types, stakes per table
- TV 4: Rotating between tournament clock and waiting list - **Seatings**: Tournament seat assignments after draws
- **Custom slideshow**: Announcements, promotions, venue info (managed by staff)
- **Rotation mode**: Cycle between multiple views on a configurable timer
The **Display Manager** (part of the admin dashboard) lets floor managers: ### Screen Manager
- Assign content to each Chromecast device
- Configure rotation/cycling between views The **Screen Manager** (part of the admin dashboard) lets floor managers:
- Send one-time announcements to all screens
- See all connected display devices with status (online, offline, content)
- Assign content streams to each device (TV 1-5: tournament clock, TV 6: waitlist, etc.)
- Configure rotation/cycling between views per device
- Send one-time announcements to all screens or specific screens
- Adjust display themes (dark/light, font size, venue branding) - Adjust display themes (dark/light, font size, venue branding)
- Group screens (e.g. "Tournament Area", "Cash Room", "Lobby")
### Technical Details ### Technical Details
- Register the receiver app with Google Cast Developer Console (one-time setup, $5 fee) - Display web app is served by the local node's HTTP server (Axum) for lowest latency
- Use Cast Application Framework (CAF) Receiver SDK v3 - WebSocket connection for live data updates (tournament clock ticks, waitlist changes)
- The receiver app is a standard web page — can use any web framework (SvelteKit static build) - Each display device is identified by a stable device ID (generated on first boot, persisted)
- Sender integration: use the `cast.framework.CastContext` API in the admin dashboard - mDNS service type: `_pvm-display._tcp.local` for auto-discovery
- For **local network casting** (offline mode): the local node serves the receiver app directly, and the Chromecast connects to the local node's IP - Display URLs: `http://{local-node-ip}/display/{device-id}` (local) or `https://app.pvmapp.com/display/{device-id}` (cloud fallback)
- Consider also supporting **generic HDMI displays** via a simple browser in kiosk mode (Chromium on a secondary RPi or mini PC) as a non-Chromecast fallback - Dark mode by default (poker venues are low-light environments)
- Large fonts, high contrast — designed for viewing from across the room
### Chromecast Compatibility
Chromecast is supported as a **display target** but not the primary architecture:
- Smart TVs with built-in Chromecast or attached Chromecast dongles can open the display URL
- No Google Cast SDK dependency — just opening a URL
- The Android display client app is the recommended approach for reliability and offline support
### Gotchas ### Gotchas
- Chromecast devices have limited memory and CPU — keep the receiver app lightweight (Svelte is ideal here) - Android kiosk mode needs careful implementation — prevent users from exiting the app, handle OS updates gracefully
- Cast sessions can timeout after inactivity — implement keep-alive messages - mDNS can be unreliable on some enterprise/venue networks — always offer manual IP fallback
- Chromecast requires an internet connection for initial app load (it fetches the receiver URL from Google's servers) — for fully offline venues, the kiosk-mode browser fallback is essential - Display devices on venue WiFi may have intermittent connectivity — design for reconnection and state catch-up
- Test on actual Chromecast hardware early — the developer emulator doesn't catch all issues - Keep the display app extremely lightweight — some $40 Android boxes have limited RAM
- Cast SDK requires HTTPS for the receiver URL in production (self-signed certs won't work on Chromecast) - Test on actual cheap Android hardware early — performance varies wildly
- Power cycling (venue closes nightly) must be handled gracefully — auto-start, auto-reconnect, auto-resume
--- ---
@ -710,71 +733,76 @@ PVM's mobile needs are primarily **consumption-oriented** — players check tour
## 13. Deployment & Infrastructure ## 13. Deployment & Infrastructure
### Recommendation: **Fly.io** (primary cloud) + **Docker** containers + **GitHub Actions** CI/CD ### Recommendation: **Self-hosted on Hetzner PVE** (LXC containers) + **Docker** + **Forgejo Actions** CI/CD
### Alternatives Considered
| Platform | Pros | Cons |
|----------|------|------|
| **Fly.io** | Edge deployment, built-in Postgres, simple scaling, good pricing, Rust-friendly | CLI-first workflow, no built-in CI/CD |
| **Railway** | Excellent DX, GitHub integration, preview environments | Less edge presence, newer |
| **AWS (ECS/Fargate)** | Full control, enterprise grade, broadest service catalog | Complex, expensive operations overhead |
| **Render** | Simple, good free tier | Less flexible networking, no edge |
| **Hetzner + manual** | Cheapest, full control | Operations burden, no managed services |
### Reasoning ### Reasoning
**Fly.io** is the best fit for PVM: The project already has a Hetzner Proxmox VE (PVE) server. Running PVM in LXC containers on the existing infrastructure keeps costs minimal and gives full control.
1. **Edge deployment**: Fly.io runs containers close to users. For a poker venue SaaS with venues in multiple cities/countries, edge deployment means lower latency for real-time tournament updates. 1. **LXC containers on PVE**: Lightweight, near-native performance, easy to snapshot and backup. Each service gets its own container or Docker runs inside an LXC.
2. **Built-in Postgres**: Fly Postgres is managed, with automatic failover and point-in-time recovery. 2. **Docker Compose for services**: All cloud services defined in a single `docker-compose.yml`. Simple to start, stop, and update.
3. **Fly Machines**: Fine-grained control over machine placement — can run NATS, DragonflyDB, and the API server as separate Fly machines. 3. **No vendor lock-in**: Everything runs on standard Linux + Docker. Can migrate to any cloud or other bare metal trivially.
4. **Rust-friendly**: Fly.io's multi-stage Docker builds work well for Rust (build on large machine, deploy tiny binary). 4. **WireGuard for RPi5 connectivity**: RPi5 local nodes connect to the Hetzner server via WireGuard tunnel for secure NATS leaf node communication.
5. **Private networking**: Fly's WireGuard mesh enables secure communication between services without exposing ports publicly. The RPi5 local nodes can use Fly's WireGuard to connect to the cloud NATS cluster. 5. **Forgejo Actions**: CI/CD runs on the same Forgejo instance hosting the code.
6. **Reasonable pricing**: Pay-as-you-go, no minimum commitment. Scale to zero for staging environments.
### Infrastructure Layout ### Infrastructure Layout
``` ```
Fly.io Cloud Hetzner PVE Server
├── pvm-api (Axum, 2+ instances, auto-scaled) ├── LXC: pvm-cloud
├── pvm-ws-gateway (Axum WebSocket, 2+ instances) │ ├── Docker: pvm-api (Axum)
├── pvm-nats (NATS cluster, 3 nodes) │ ├── Docker: pvm-ws-gateway (Axum WebSocket)
├── pvm-db (Fly Postgres, primary + replica) │ ├── Docker: pvm-worker (background jobs: sync, notifications)
├── pvm-cache (DragonflyDB, single node) │ ├── Docker: pvm-nats (NATS cluster)
└── pvm-worker (background jobs: sync processing, notifications) │ ├── Docker: pvm-db (PostgreSQL 16)
│ └── Docker: pvm-cache (DragonflyDB)
├── LXC: pvm-staging (mirrors production for testing)
└── WireGuard endpoint for RPi5 nodes
Venue (RPi5) Venue (RPi5 — Docker on Raspberry Pi OS)
└── pvm-node (single Rust binary + NATS leaf node) ├── Docker: pvm-node (Rust binary — API proxy + sync engine)
└── connects to pvm-nats via WireGuard/TLS ├── Docker: pvm-nats-leaf (NATS leaf node)
└── connects to Hetzner via WireGuard/TLS
``` ```
### CI/CD Pipeline (GitHub Actions) ### RPi5 Local Node (Docker-based)
The local node runs **Docker on stock Raspberry Pi OS (64-bit)**:
- **Provisioning**: One-liner curl script installs Docker and pulls the PVM stack (`docker compose pull && docker compose up -d`)
- **Updates**: Pull new images and restart (`docker compose pull && docker compose up -d`). Automated via a cron job or self-update webhook.
- **Rollback**: Previous images remain on disk. Roll back with `docker compose up -d --force-recreate` using pinned image tags.
- **Services**: `pvm-node` (Rust binary) + `pvm-nats-leaf` (NATS leaf node). Two containers, minimal footprint.
- **Storage**: libSQL database stored in a Docker volume on the SD card (or USB SSD for heavy-write venues).
### CI/CD Pipeline (Forgejo Actions)
```yaml ```yaml
# Triggered on push to main # Triggered on push to main
1. Lint (clippy, eslint) 1. Lint (clippy, biome)
2. Test (cargo test, vitest, playwright) 2. Test (cargo nextest, vitest, playwright)
3. Build (multi-stage Docker for cloud, cross-compile for RPi5) 3. Build (multi-stage Docker for cloud + cross-compile ARM64 for RPi5)
4. Deploy staging (auto-deploy to Fly.io staging) 4. Push images to container registry
5. E2E tests against staging 5. Deploy staging (docker compose pull on staging LXC)
6. Deploy production (manual approval gate) 6. E2E tests against staging
7. Publish RPi5 binary (signed, to update server) 7. Deploy production (manual approval, docker compose on production LXC)
8. Publish RPi5 images (ARM64 Docker images to registry)
``` ```
### Gotchas ### Gotchas
- Fly.io Postgres is not fully managed — you still need to handle major version upgrades and backup verification - Use multi-stage Docker builds for Rust: builder stage with `rust:bookworm`, runtime stage with `debian:bookworm-slim` or `distroless`
- Use multi-stage Docker builds to keep Rust image sizes small (builder stage with `rust:bookworm`, runtime stage with `debian:bookworm-slim` or `distroless`) - PostgreSQL backups: automate `pg_dump` to a separate backup location (another Hetzner storage box or off-site)
- Pin Fly.io machine regions to match your target markets — don't spread too thin initially - Set up blue-green deployments via Docker Compose profiles for zero-downtime upgrades
- Set up blue-green deployments for zero-downtime upgrades - Monitor Hetzner server resources — if PVM outgrows a single server, split services across multiple LXCs or servers
- The RPi5 binary update mechanism needs a rollback strategy — keep the previous binary and a fallback boot option - WireGuard keys for RPi5 nodes: automate key generation and registration during provisioning
- The RPi5 Docker update mechanism needs a health check — if new images fail, auto-rollback to previous tag
--- ---
## 14. Monitoring & Observability ## 14. Monitoring & Observability
### Recommendation: **OpenTelemetry** (traces + metrics + logs) exported to **Grafana Cloud** (or self-hosted Grafana + Loki + Tempo + Prometheus) ### Recommendation: **OpenTelemetry** (traces + metrics + logs) exported to **self-hosted Grafana + Loki + Tempo + Prometheus** (on Hetzner PVE)
### Alternatives Considered ### Alternatives Considered
@ -1142,9 +1170,9 @@ Venues should be able to customize their displays:
| **Auth** | Custom JWT + RBAC | Offline-capable, cross-venue, full control | | **Auth** | Custom JWT + RBAC | Offline-capable, cross-venue, full control |
| **API Design** | REST + OpenAPI 3.1 | Generated TypeScript client, universal compatibility | | **API Design** | REST + OpenAPI 3.1 | Generated TypeScript client, universal compatibility |
| **Mobile** | PWA first, Capacitor later | One codebase, offline support, app store when needed | | **Mobile** | PWA first, Capacitor later | One codebase, offline support, app store when needed |
| **Cast/Display** | Google Cast SDK + Custom Web Receiver | SvelteKit static app on Chromecast | | **Displays** | Generic web app + Android display client | No Cast SDK dependency, works offline, mDNS auto-discovery |
| **Deployment** | Fly.io + Docker | Edge deployment, managed Postgres, WireGuard | | **Deployment** | Hetzner PVE + Docker (LXC containers) | Self-hosted, full control, existing infrastructure |
| **CI/CD** | GitHub Actions + Turborepo | Cross-language build orchestration, caching | | **CI/CD** | Forgejo Actions + Turborepo | Cross-language build orchestration, caching |
| **Monitoring** | OpenTelemetry + Grafana | Vendor-neutral, excellent Rust support | | **Monitoring** | OpenTelemetry + Grafana | Vendor-neutral, excellent Rust support |
| **Testing** | cargo-nextest + Vitest + Playwright | Full pyramid: unit, integration, E2E | | **Testing** | cargo-nextest + Vitest + Playwright | Full pyramid: unit, integration, E2E |
| **Styling** | Tailwind CSS v4 | Fast, small bundles, Svelte-native | | **Styling** | Tailwind CSS v4 | Fast, small bundles, Svelte-native |
@ -1153,38 +1181,33 @@ Venues should be able to customize their displays:
--- ---
## Open Questions / Decisions Needed ## Decisions Made
### High Priority > Resolved during tech stack review session, 2026-02-08.
1. **Fly.io vs. self-hosted**: Fly.io simplifies operations but creates vendor dependency. For a bootstrapped SaaS, the convenience is worth it. For VC-funded with an ops team, self-hosted on Hetzner could be cheaper at scale. **Decision: Start with Fly.io, design for portability.** | # | Question | Decision |
|---|----------|----------|
| 1 | **Hosting** | **Self-hosted on Hetzner PVE** — LXC containers. Already have infrastructure. No Fly.io dependency. |
| 2 | **Sync strategy** | **Event-based sync via NATS JetStream** — all mutations are events, local node replays events to build state. Perfect audit trail. No table-vs-row debate. |
| 3 | **NATS on RPi5** | **Sidecar** — separate process managed by systemd/Docker. Independently upgradeable and monitorable. |
| 4 | **Financial data** | **No money handling at all.** Venues handle payments via their own POS systems (most are cash-based). PVM only tracks game data. |
| 5 | **Multi-region** | **Single region initially.** Design DB schema and NATS subjects for eventual multi-region without rewrite. |
| 6 | **Player accounts** | **PVM signup first.** Players always create a PVM account before joining venues. No deduplication problem. |
| 7 | **Display strategy** | **Generic web app + Android display client.** TVs run a simple Android app (or $40 Android box) that connects to the local node via mDNS auto-discovery, receives its display assignment via WebSocket, and renders a web page. Falls back to cloud SaaS if local node is offline. Chromecast is supported but not the primary path. No Google Cast SDK dependency. |
| 8 | **RPi5 provisioning** | **Docker on stock Raspberry Pi OS.** All PVM services (node, NATS) run as containers. Updates via image pulls. Provisioning is a one-liner curl script. |
| 9 | **Offline duration** | **72 hours.** Covers a full weekend tournament series. After 72h offline, warn staff but keep operating. Sync everything on reconnect. |
| 10 | **API style** | **REST + OpenAPI 3.1.** Auto-generated TypeScript client. Universal, debuggable, works with everything. |
2. **libSQL sync granularity**: Should the local node sync entire tables or individual rows? Row-level sync is more efficient but more complex to implement. **Recommendation: Start with table-level sync for the initial version, refine to row-level as data volumes grow.** ## Deferred Questions
3. **NATS embedded vs. sidecar on RPi5**: Running NATS as an embedded library (via `nats-server` Rust bindings) vs. a separate process. Embedded is simpler but couples versions tightly. **Recommendation: Sidecar (separate process managed by systemd) for operational flexibility.** These remain open for future consideration:
4. **Financial data handling**: Does PVM handle actual money transactions, or only track buy-ins/credits as records? If handling real money, PCI DSS and financial regulations apply. **Recommendation: Track records only. Integrate with Stripe for actual payments.** 1. **API versioning strategy**: Maintain backward compatibility as long as possible. Only version on breaking changes. Revisit when approaching first external API consumers.
5. **Multi-region from day one?**: Should the initial architecture support venues in multiple countries/regions? This affects Postgres replication strategy and NATS cluster topology. **Recommendation: Single region initially, design NATS subjects and DB schema for eventual multi-region.** 2. **GraphQL for player-facing app**: REST is sufficient for v1. The player app might benefit from GraphQL's flexible querying later (e.g., "show me my upcoming tournaments across all venues with waitlist status"). **Revisit after v1 launch.**
### Medium Priority 3. **WebTransport**: When browser support matures, could replace WebSockets for lower-latency real-time streams. **Monitor but do not adopt yet.**
6. **Player account deduplication**: When a player signs up at two venues independently, how do we detect and merge accounts? Email match? Phone match? Manual linking? **Needs product decision.** 4. **WASM on local node**: Could parts of the frontend run on the local node via WASM for ultra-fast local rendering. **Defer.**
7. **Chromecast vs. generic display hardware**: Should the primary display strategy be Chromecast, or should we target a browser-in-kiosk-mode approach that also works with Chromecast? **Recommendation: Build the receiver as a standard web app first (works in kiosk mode), add Cast SDK integration second.** 5. **AI features**: Player behavior analytics, optimal table assignments, tournament structure recommendations. The data model should be designed to support future ML pipelines. **Design for it, build later.**
8. **RPi5 provisioning**: How are local nodes set up? Manual image flashing? Automated provisioning? Remote setup? **Recommendation: Pre-built OS image with first-boot wizard that connects to cloud and provisions the node.**
9. **Offline duration limits**: How long should a local node operate offline before we consider the data stale? 1 hour? 1 day? 1 week? **Needs product decision based on venue feedback.**
10. **API versioning strategy**: When do we introduce `/api/v2/`? Should we support multiple versions simultaneously? **Recommendation: Semantic versioning for the API spec. Maintain backward compatibility as long as possible. Only version on breaking changes.**
### Low Priority
11. **GraphQL for player-facing app**: The admin dashboard is well-served by REST, but the player app might benefit from GraphQL's flexible querying (e.g., "show me my upcoming tournaments across all venues with waitlist status"). **Revisit after v1 launch.**
12. **WebTransport**: When browser support matures and Chromecast supports it, WebTransport could replace WebSockets for lower-latency, multiplexed real-time streams. **Monitor but do not adopt yet.**
13. **WASM on local node**: Could parts of the frontend run on the local node via WASM for ultra-fast local rendering? Interesting but not a priority. **Defer.**
14. **AI features**: Player behavior analytics, optimal table assignments, tournament structure recommendations. The data model should be designed to support future ML pipelines. **Design for it, build later.**