From 27ee0a5813c3a31e55997966dd848605f4297765 Mon Sep 17 00:00:00 2001 From: Mikkel Georgsen Date: Sat, 28 Feb 2026 15:59:25 +0100 Subject: [PATCH] docs: complete project research --- .planning/research/ARCHITECTURE.md | 496 +++++++++++++++++++++++++++++ .planning/research/FEATURES.md | 347 ++++++++++++++++++++ .planning/research/PITFALLS.md | 456 ++++++++++++++++++++++++++ .planning/research/STACK.md | 278 ++++++++++++++++ .planning/research/SUMMARY.md | 263 +++++++++++++++ 5 files changed, 1840 insertions(+) create mode 100644 .planning/research/ARCHITECTURE.md create mode 100644 .planning/research/FEATURES.md create mode 100644 .planning/research/PITFALLS.md create mode 100644 .planning/research/STACK.md create mode 100644 .planning/research/SUMMARY.md diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md new file mode 100644 index 0000000..00e345d --- /dev/null +++ b/.planning/research/ARCHITECTURE.md @@ -0,0 +1,496 @@ +# Architecture Research + +**Domain:** Edge-cloud poker venue management platform (offline-first, three-tier) +**Researched:** 2026-02-28 +**Confidence:** MEDIUM-HIGH (core patterns well-established; NATS leaf node specifics verified via official docs) + +## Standard Architecture + +### System Overview + +``` +┌─────────────────────────────────────────────────────────────────────────────────┐ +│ CLOUD TIER (Core) │ +│ Hetzner Dedicated — Proxmox VE — LXC Containers │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ +│ │ Go Core API │ │ PostgreSQL │ │NATS JetStream│ │ Authentik │ │ +│ │ (multi- │ │ (venue │ │ (hub cluster│ │ (OIDC IdP) │ │ +│ │ tenant) │ │ agg. data) │ │ R=3) │ │ │ │ +│ └──────┬───────┘ └──────────────┘ └──────┬───────┘ └──────────────────┘ │ +│ │ │ │ +│ ┌──────────────┐ ┌──────────────┐ │ │ +│ │ SvelteKit │ │ Netbird │ │ ← mirrors from leaf streams │ +│ │ (public │ │ (WireGuard │ │ │ +│ │ pages, │ │ mesh ctrl) │ │ │ +│ │ admin UI) │ │ │ │ │ +│ └──────────────┘ └──────────────┘ │ │ +└────────────────────────────────────────────────────────────────────────────────┘ + │ WireGuard encrypted tunnel (Netbird mesh) + │ NATS leaf node connection (domain: "leaf-") + │ NetBird reverse proxy (HTTPS → WireGuard → Leaf :8080) + ↓ +┌─────────────────────────────────────────────────────────────────────────────────┐ +│ EDGE TIER (Leaf Node) │ +│ ARM64 SBC — Orange Pi 5 Plus — NVMe — ~€100 │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ +│ │ Go Leaf API │ │ LibSQL │ │NATS JetStream│ │ SvelteKit │ │ +│ │ (tournament │ │ (embedded │ │ (embedded │ │ (operator UI │ │ +│ │ engine, │ │ SQLite + │ │ leaf node, │ │ served from │ │ +│ │ state mgr) │ │ WAL-based) │ │ local │ │ Leaf) │ │ +│ └──────┬───────┘ └──────────────┘ │ streams) │ └──────────────────┘ │ +│ │ └──────┬───────┘ │ +│ │ WebSocket broadcast │ mirror stream │ +│ │ ↓ (store-and-forward) │ +│ ┌──────────────┐ to Core when online │ +│ │ Hub Manager │ │ +│ │ (client │ │ +│ │ registry, │ │ +│ │ broadcast) │ │ +│ └──────────────┘ │ +└─────────────────────────────────────────────────────────────────────────────────┘ + │ Local WiFi / Ethernet + │ WebSocket (ws:// — LAN only, no TLS needed) + │ Chromium kiosk HTTP polling / WebSocket + ↓ +┌─────────────────────────────────────────────────────────────────────────────────┐ +│ DISPLAY TIER (Display Nodes) │ +│ Raspberry Pi Zero 2W — 512MB RAM — ~€20 each │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ +│ │ Chromium │ │ Chromium │ │ Chromium │ │ Chromium │ │ +│ │ Kiosk │ │ Kiosk │ │ Kiosk │ │ Kiosk │ │ +│ │ (Clock │ │ (Rankings) │ │ (Seating) │ │ (Signage) │ │ +│ │ view) │ │ │ │ │ │ │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────────┘ │ +│ Raspberry Pi OS Lite + X.Org + Openbox + Chromium (kiosk mode, no UI chrome) │ +└─────────────────────────────────────────────────────────────────────────────────┘ + +Player Phones (PWA) ←→ Netbird reverse proxy ←→ Leaf Node + (public HTTPS URL, same URL from any network) +``` + +### Component Responsibilities + +| Component | Responsibility | Implementation | +|-----------|----------------|----------------| +| **Go Leaf API** | Tournament engine, financial engine, state machine, WebSocket hub, REST/WS API for operator + players | Go binary (ARM64), goroutine-per-subsystem, embedded NATS + LibSQL | +| **LibSQL (Leaf)** | Single source of truth for venue state, tournament data, player records | Embedded SQLite via LibSQL driver (`github.com/tursodatabase/go-libsql`), WAL mode | +| **NATS JetStream (Leaf)** | Local pub/sub for in-process events, durable stream for cloud sync, event audit log | Embedded `nats-server` process, domain `leaf-`, local stream mirrored to Core | +| **SvelteKit (from Leaf)** | Operator UI (admin SPA) served by Leaf, player PWA, display views | Static SvelteKit build served by Go's `net/http` or embedded filesystem | +| **Hub Manager (Leaf)** | WebSocket connection registry, broadcast state to all connected clients | Go goroutines + channels; one goroutine per connection, central broadcast channel | +| **Netbird Agent (Leaf)** | WireGuard tunnel to Core, reverse proxy target registration, DNS | Netbird client process, auto-reconnects, handles NAT traversal via STUN/TURN | +| **Go Core API** | Multi-tenant aggregation, cross-venue leagues, player identity, remote API access, cloud-hosted free tier | Go binary (amd64), PostgreSQL with RLS, NATS hub cluster | +| **PostgreSQL (Core)** | Persistent store for aggregated venue data, player profiles, leagues, analytics | PostgreSQL 17+, RLS by `venue_id`, pgx driver in Go | +| **NATS JetStream (Core)** | Hub cluster receiving mirrored streams from all leaves, fan-out to analytics consumers | Clustered NATS (R=3), domain `core`, stream sources from all leaf mirrors | +| **Authentik (Core)** | OIDC identity provider for Netbird and Felt operator auth, PIN login fallback | Self-hosted Authentik, ~200MB RAM, Apache 2.0 | +| **Netbird Control (Core)** | Mesh network management plane, policy distribution, reverse proxy routing | Self-hosted Netbird management + signal services | +| **SvelteKit (Core)** | Public venue pages (SSR), admin dashboard, free-tier virtual Leaf UI | SvelteKit with SSR for public pages, SPA for dashboard | +| **Display Nodes** | Render assigned view (clock/rankings/seating/signage) in kiosk browser | Pi Zero 2W + Raspberry Pi OS Lite + X.Org + Openbox + Chromium kiosk | + +## Recommended Project Structure + +``` +felt/ +├── cmd/ +│ ├── leaf/ # Leaf Node binary entrypoint (ARM64 target) +│ │ └── main.go # Boots LibSQL, embedded NATS, HTTP/WS server +│ └── core/ # Core binary entrypoint (amd64 target) +│ └── main.go # Boots PostgreSQL conn, NATS hub, HTTP server +│ +├── internal/ +│ ├── tournament/ # Domain: tournament engine (state machine) +│ │ ├── engine.go # Clock, blinds, levels — pure business logic +│ │ ├── financial.go # Buy-ins, rebuys, prize pool, rake +│ │ ├── seating.go # Table layout, auto-balance, drag-and-drop +│ │ └── events.go # Domain events emitted on state changes +│ ├── player/ # Domain: player management +│ │ ├── registry.go # Player database, registration, bust-out +│ │ └── identity.go # Platform-level identity (belongs to Felt, not venue) +│ ├── display/ # Domain: display node management +│ │ ├── registry.go # Node registration, view assignment +│ │ └── views.go # View types: clock, rankings, seating, signage +│ ├── sync/ # NATS JetStream sync layer +│ │ ├── leaf.go # Leaf-side: publish events, mirror config +│ │ └── core.go # Core-side: consume from leaf mirrors, aggregate +│ ├── ws/ # WebSocket hub +│ │ ├── hub.go # Client registry, broadcast channel +│ │ └── handler.go # Upgrade, read pump, write pump +│ ├── api/ # HTTP handlers (shared where possible) +│ │ ├── tournament.go +│ │ ├── player.go +│ │ └── display.go +│ ├── store/ # Data layer +│ │ ├── libsql/ # Leaf: LibSQL queries (sqlc generated) +│ │ └── postgres/ # Core: PostgreSQL queries (sqlc generated) +│ └── auth/ # Auth: PIN offline, OIDC online +│ ├── pin.go +│ └── oidc.go +│ +├── frontend/ # SvelteKit applications +│ ├── operator/ # Operator UI (served from Leaf) +│ ├── player/ # Player PWA (served from Leaf) +│ ├── display/ # Display views (served from Leaf) +│ └── public/ # Public venue pages (served from Core, SSR) +│ +├── schema/ +│ ├── libsql/ # LibSQL migrations (goose or atlas) +│ └── postgres/ # PostgreSQL migrations (goose or atlas) +│ +├── build/ +│ ├── leaf/ # Dockerfile.leaf, systemd units, LUKS scripts +│ └── core/ # Dockerfile.core, LXC configs, Proxmox notes +│ +└── scripts/ + ├── cross-build.sh # GOOS=linux GOARCH=arm64 go build ./cmd/leaf + └── provision-leaf.sh # Flash + configure a new Leaf device +``` + +### Structure Rationale + +- **cmd/leaf vs cmd/core:** Same internal packages, different main.go wiring. Shared domain logic compiles to both targets without duplication. GOARCH=arm64 for leaf, default for core. +- **internal/tournament/:** Pure domain logic with no I/O dependencies. Testable without database or NATS. +- **internal/sync/:** The bridge between domain events and NATS JetStream. Leaf publishes; Core subscribes via mirror. +- **internal/ws/:** Hub pattern isolates WebSocket concerns. Goroutines for each connection; central broadcast channel prevents blocking. +- **schema/libsql vs schema/postgres:** Separate migration paths because LibSQL (SQLite dialect) and PostgreSQL have syntax differences (no arrays, different types). + +## Architectural Patterns + +### Pattern 1: NATS JetStream Leaf-to-Core Domain Sync + +**What:** Leaf node runs an embedded NATS server with its own JetStream domain (`leaf-`). All state-change events are published to a local stream. Core creates a mirror of this stream using stream source configuration. JetStream's store-and-forward guarantees delivery when the connection resumes after offline periods. + +**When to use:** For any state that needs to survive offline periods and eventually reach Core. All tournament events, financial transactions, player registrations. + +**Trade-offs:** At-least-once delivery means consumers must be idempotent. Message IDs on publish plus deduplication windows on Core resolve this. No ordering guarantees across subjects, but per-subject ordering is preserved. + +**Domain configuration (NATS server config on Leaf):** +```hcl +# leaf-node.conf +jetstream { + domain: "leaf-venue-abc123" + store_dir: "/data/nats" +} + +leafnodes { + remotes [ + { + urls: ["nats://core.felt.internal:7422"] + account: "$G" + } + ] +} +``` + +**Mirror configuration (Core side — creates source from leaf domain):** +```go +// core/sync.go +js.AddStream(ctx, jetstream.StreamConfig{ + Name: "VENUE_ABC123_EVENTS", + Sources: []*jetstream.StreamSource{ + { + Name: "VENUE_EVENTS", + Domain: "leaf-venue-abc123", + FilterSubject: "venue.abc123.>", + }, + }, +}) +``` + +### Pattern 2: WebSocket Hub-and-Broadcast for Real-Time Clients + +**What:** Central Hub struct in Go holds a map of active connections (operator UI, player PWA, display nodes). State changes trigger a broadcast to the Hub. The Hub writes to each connection's send channel. Per-connection goroutines handle read and write independently, preventing slow clients from blocking others. + +**When to use:** Any real-time update that needs to reach all connected clients within 100ms — clock ticks, table state changes, seating updates. + +**Trade-offs:** In-process hub is simple and fast. No Redis pub/sub needed at single-venue scale. Restart drops all connections (clients must reconnect — which is standard WebSocket behavior). + +**Example (Hub pattern):** +```go +type Hub struct { + clients map[*Client]bool + broadcast chan []byte + register chan *Client + unregister chan *Client +} + +func (h *Hub) Run() { + for { + select { + case client := <-h.register: + h.clients[client] = true + case client := <-h.unregister: + delete(h.clients, client) + close(client.send) + case message := <-h.broadcast: + for client := range h.clients { + select { + case client.send <- message: // non-blocking + default: + close(client.send) + delete(h.clients, client) + } + } + } + } +} +``` + +### Pattern 3: Offline-First with Local-Writes-First + +**What:** All writes go to LibSQL (Leaf) first, immediately confirming to the client. LibSQL write triggers a domain event published to local NATS stream. NATS mirrors the event to Core asynchronously when online. The UI subscribes to WebSocket and sees state changes from the local store — never waiting on the network. + +**When to use:** All operational writes: starting clock, registering buy-in, busting a player, assigning a table. + +**Trade-offs:** Core is eventually consistent with Leaf, not strongly consistent. For operational use (venue running a tournament), this is the correct trade-off — the venue never waits for the cloud. Cross-venue features (league standings) accept slight delay. + +### Pattern 4: Event-Sourced Audit Trail via JetStream Streams + +**What:** NATS JetStream streams are append-only and immutable by default. Every state change (clock pause, player bust-out, financial transaction) is published as an event with sequence number and timestamp. The stream retains full history. This doubles as the sync mechanism and the audit log. Current state in LibSQL is the projection of these events. + +**When to use:** All state changes that need an audit trail (financial transactions, player registrations, table assignments). + +**Trade-offs:** Stream storage grows over time (limit by time or byte size for old tournaments). Projecting current state from events adds complexity on recovery — mitigate with snapshots in LibSQL. Full event history is available in Core for analytics. + +### Pattern 5: Display Node View Assignment via URL Parameters + +**What:** Each Pi Zero 2W Chromium instance opens a URL like `http://leaf.local:8080/display?view=clock&tournament=abc123`. The Leaf serves the display SvelteKit app. The view is determined by URL parameter, set via the operator UI's node registry. Chromium kiosk mode (no UI chrome) renders full-screen. Changes to view assignment push through WebSocket, triggering client-side navigation. + +**When to use:** All display node management — view assignment, content scheduling, emergency override. + +**Trade-offs:** URL-based assignment is simple and stateless on the display node. Requires reliable local WiFi. Pi Zero 2W's 512MB RAM constrains complex Svelte animations; keep display views lightweight (clock, text, simple tables). + +## Data Flow + +### Tournament State Change Flow (Operator Action) + +``` +Operator touches UI (e.g., "Advance Level") + ↓ +SvelteKit → POST /api/tournament/{id}/level/advance + ↓ +Go Leaf API handler validates & applies change + ↓ +LibSQL write (authoritative local state update) + ↓ +Domain event emitted: {type: "LEVEL_ADVANCED", level: 5, ...} + ↓ +Event published to NATS subject: "venue.{id}.tournament.{id}.events" + ↓ +NATS local stream appends event (immutable audit log) + ↓ (parallel) +Hub.broadcast ← serialized state delta (JSON) + ↓ +All WebSocket clients receive update within ~10ms + ├── Operator UI: updates clock display + ├── Player PWA: updates blind levels shown + └── Display Nodes: all views react to new state + ↓ (async, when online) +NATS mirror replicates event to Core stream + ↓ +Core consumer processes event → writes to PostgreSQL + ↓ +Aggregated data available for cross-venue analytics +``` + +### Player Phone Access Flow (Online) + +``` +Player scans QR code → browser opens https://venue.felt.app/play + ↓ +DNS resolves to Core (public IP) + ↓ +NetBird reverse proxy (TLS termination at proxy) + ↓ +Encrypted WireGuard tunnel → Leaf Node :8080 + ↓ +Go Leaf API serves SvelteKit PWA + ↓ +PWA opens WebSocket ws://venue.felt.app/ws (proxied via same mechanism) + ↓ +Player sees live clock, blinds, rankings, personal stats +``` + +### Display Node Lifecycle + +``` +Pi Zero 2W boots → systemd starts X.Org → Openbox autostart → Chromium kiosk + ↓ +Chromium opens: http://leaf.local:8080/display?node-id=display-001 + ↓ +Leaf API: lookup node-id in node registry → determine assigned view + ↓ +SvelteKit display app renders assigned view (clock / rankings / seating / signage) + ↓ +WebSocket connection held to Leaf + ↓ +When operator reassigns view → Hub broadcasts view-change event + ↓ +Display SvelteKit navigates to new view (client-side routing, no page reload) +``` + +### Offline → Online Reconnect Sync + +``` +Leaf node was offline (NATS leaf connection dropped) + ↓ +Venue continues operating normally (LibSQL is authoritative, NATS local streams work) +All events accumulate in local JetStream stream (store-and-forward) + ↓ +WireGuard tunnel restored (Netbird handles auto-reconnect) + ↓ +NATS leaf node reconnects to Core hub + ↓ +JetStream mirror resumes replication from last sequence number + ↓ +Core processes accumulated events in order (per-subject ordering preserved) + ↓ +PostgreSQL updated with all events that occurred during offline period +``` + +### Multi-Tenant Core Data Model + +``` +Core PostgreSQL: + venues (id, name, netbird_peer_id, subscription_tier, ...) + tournaments (id, venue_id, ...) ← RLS: venue_id = current_setting('app.venue_id') + players (id, felt_user_id, ...) ← platform-level identity (no venue_id) + league_standings (id, league_id, ...) ← cross-venue aggregation +``` + +## Scaling Considerations + +| Scale | Architecture Adjustments | +|-------|--------------------------| +| 1-50 venues (MVP) | Single Core server on Hetzner; NATS single-node or simple cluster; LibSQL on each Leaf is the bottleneck-free read path | +| 50-500 venues | NATS core cluster R=3 is already the design; PostgreSQL read replicas for analytics; SvelteKit public site to CDN | +| 500+ venues | NATS super-cluster across Hetzner regions; PostgreSQL sharding by venue_id; dedicated analytics database (TimescaleDB or ClickHouse for event stream) | + +### Scaling Priorities + +1. **First bottleneck:** Core NATS hub receiving mirrors from many leaves simultaneously. Mitigation: NATS is designed for this — 50M messages/sec benchmarks. Won't be the bottleneck before 500 venues. +2. **Second bottleneck:** PostgreSQL write throughput as event volume grows. Mitigation: NATS stream is the durable store; Postgres writes are async. TimescaleDB for time-series event analytics defers this further. +3. **Not a bottleneck:** Leaf Node WebSocket clients — 25,000+ connections on a modest server (the Leaf handles 1 venue, typically 5-50 concurrent clients). + +## Anti-Patterns + +### Anti-Pattern 1: Making Core a Write Path Dependency + +**What people do:** Design operator actions to write to Core (cloud) first, then sync down to Leaf. +**Why it's wrong:** The primary constraint is offline-first. If Core is the write path, any internet disruption breaks the entire operation. +**Do this instead:** Leaf is always the authoritative write target. Core is a read/analytics/aggregation target. Never make Core an operational dependency. + +### Anti-Pattern 2: Shared Database Between Leaf and Core + +**What people do:** Try to use a single LibSQL instance with remote replication as both the Leaf store and Core store. +**Why it's wrong:** LibSQL embedded replication (Turso model) requires connectivity to the remote primary for writes. This violates offline-first. Also: Core needs PostgreSQL features (RLS, complex queries, multi-venue joins) that LibSQL cannot provide. +**Do this instead:** Separate data stores per tier. LibSQL on Leaf (sovereign, offline-capable). PostgreSQL on Core (multi-tenant, cloud-native). NATS JetStream is the replication channel, not the database driver. + +### Anti-Pattern 3: Single Goroutine WebSocket Broadcast + +**What people do:** Iterate over all connected clients in a single goroutine and write synchronously. +**Why it's wrong:** A slow or disconnected client blocks the broadcast for all others. One stale connection delays the clock update for everyone. +**Do this instead:** Hub pattern with per-client send channels (buffered). Use `select` with a `default` case to drop slow clients rather than block. Per-connection goroutines handle writes to the actual WebSocket. + +### Anti-Pattern 4: Storing View Assignment State on Display Nodes + +**What people do:** Configure display views locally on each Pi and SSH in to change them. +**Why it's wrong:** Requires SSH access to each device. No central management. Adding a new display means physical configuration. Breaking offline-first if central config is required at boot. +**Do this instead:** Display nodes are stateless. They register with the Leaf by device ID (MAC or serial). Leaf holds the view assignment. Display nodes poll/subscribe for their assignment. Swap physical Pi without reconfiguration. + +### Anti-Pattern 5: Separate Go Codebases for Leaf and Core + +**What people do:** Create two independent Go repositories with duplicated domain logic. +**Why it's wrong:** Business logic diverges over time. Bugs fixed in one aren't fixed in the other. Double maintenance burden for a solo developer. +**Do this instead:** Single Go monorepo with shared `internal/` packages. `cmd/leaf/main.go` and `cmd/core/main.go` are the only divergence points — they wire up the same packages with different configuration. `GOOS=linux GOARCH=arm64 go build ./cmd/leaf` for the Leaf binary. + +## Integration Points + +### External Services + +| Service | Integration Pattern | Notes | +|---------|---------------------|-------| +| Netbird (WireGuard mesh) | Agent on Leaf connects to self-hosted Netbird management service; reverse proxy configured per-venue | NetBird reverse proxy is beta, requires Traefik as external reverse proxy on Core; test early | +| Authentik (OIDC) | Leaf uses OIDC tokens from Authentik for operator login when online; PIN login as offline fallback | PIN verification against locally cached hash in LibSQL; no Authentik dependency during offline operation | +| NATS JetStream (leaf↔core) | Leaf runs embedded NATS server as leaf node connecting to Core hub over WireGuard | Domain isolation per venue; subjects namespaced `venue..>` | + +### Internal Boundaries + +| Boundary | Communication | Notes | +|----------|---------------|-------| +| Go Leaf API ↔ LibSQL | Direct SQL via `go-libsql` driver (CGo-free driver preferred for cross-compilation) | Use `sqlc` for type-safe query generation; avoid raw string queries | +| Go Leaf API ↔ NATS (local) | In-process NATS client connecting to embedded server (`nats.Connect("nats://127.0.0.1:4222")`) | Publish on every state-change event; Hub subscribes to NATS for broadcast triggers | +| Go Leaf API ↔ WebSocket Hub | Channel-based: API handlers send to `hub.broadcast` channel | Hub runs in its own goroutine; never call Hub methods directly from handlers | +| Go Core API ↔ PostgreSQL | `pgx/v5` driver, `sqlc` generated queries; RLS via `SET LOCAL app.venue_id = $1` in transaction | Row-level security enforced at database layer as defense-in-depth | +| Go Core API ↔ NATS (hub) | Standard NATS client; consumers per-venue mirror stream | Push consumers for real-time processing; durable consumers for reliable at-least-once | +| Leaf ↔ Display Nodes | HTTP (serve SvelteKit app) + WebSocket (state updates) over local LAN | No TLS on local LAN — Leaf and displays are on the same trusted network | +| Leaf ↔ Player PWA | HTTP + WebSocket proxied via Netbird reverse proxy | HTTPS at proxy, decrypts, sends over WireGuard to Leaf | + +## Suggested Build Order + +The build order derives from dependency relationships: each layer must be tested before the layer above it depends on it. + +``` +Phase 1: Foundation (Leaf Core + Networking) + 1a. LibSQL schema + Go data layer (sqlc queries, migrations) + 1b. Tournament engine (pure Go, no I/O — state machine logic) + 1c. NATS embedded + local event publishing + 1d. WebSocket Hub (broadcast infrastructure) + 1e. REST + WS API (operator endpoints) + 1f. Netbird agent on Leaf (WireGuard mesh) + 1g. PIN auth (offline) + OIDC auth (online fallback) + ↓ Validates: Offline operation works end-to-end + +Phase 2: Frontend Clients + 2a. SvelteKit operator UI (connects to Leaf API + WS) + 2b. SvelteKit display views (connects to Leaf WS) + 2c. Player PWA (connects to Leaf via Netbird reverse proxy) + ↓ Validates: Real-time sync, display management, player access + +Phase 3: Cloud Sync (Core) + 3a. PostgreSQL schema + RLS (multi-tenant) + 3b. NATS hub cluster on Core + 3c. Leaf-to-Core stream mirroring (event replay on reconnect) + 3d. Go Core API (multi-tenant REST, league aggregation) + 3e. SvelteKit public pages (SSR) + admin dashboard + ↓ Validates: Offline sync, cross-venue features, eventual consistency + +Phase 4: Display Management + Signage + 4a. Display node registry (Leaf API) + 4b. View assignment system (operator sets view per node) + 4c. Pi Zero 2W provisioning scripts (kiosk setup automation) + 4d. Digital signage content system + scheduler + ↓ Validates: Wireless display management at scale + +Phase 5: Authentication + Security Hardening + 5a. Authentik OIDC integration + 5b. LUKS encryption on Leaf (device-level) + 5c. NATS auth callout (per-venue account isolation) + 5d. Audit trail validation (event stream integrity checks) +``` + +**Why this order:** +- Leaf foundation must exist before any frontend can connect to it +- Tournament engine logic is the most complex domain; test it isolated before adding network layers +- Cloud sync (Phase 3) is a progressive enhancement — the Leaf works completely without it +- Display management (Phase 4) depends on the WebSocket infrastructure from Phase 1 +- Auth hardening (Phase 5) is last because it can wrap existing endpoints without architectural change + +## Sources + +- [NATS Adaptive Edge Deployment](https://docs.nats.io/nats-concepts/service_infrastructure/adaptive_edge_deployment) — MEDIUM confidence (official NATS docs on leaf node architecture) +- [JetStream on Leaf Nodes](https://docs.nats.io/running-a-nats-service/configuration/leafnodes/jetstream_leafnodes) — MEDIUM confidence (official NATS docs on domain isolation and mirroring) +- [NATS JetStream Core Concepts](https://docs.nats.io/nats-concepts/jetstream) — HIGH confidence (official docs: at-least-once, mirroring, consumer patterns) +- [Synadia: AI at the Edge with NATS JetStream](https://www.synadia.com/blog/ai-at-the-edge-with-nats-jetstream) — LOW confidence (single source, useful patterns) +- [NetBird Reverse Proxy Docs](https://docs.netbird.io/manage/reverse-proxy) — MEDIUM confidence (official Netbird docs; note: beta feature, requires Traefik) +- [LibSQL Embedded Replicas](https://docs.turso.tech/features/embedded-replicas/introduction) — MEDIUM confidence (Turso official docs; embedded replication model) +- [Multi-Tenancy Database Patterns in Go](https://www.glukhov.org/post/2025/11/multitenant-database-patterns/) — LOW confidence (single source, corroborates general PostgreSQL RLS pattern) +- [Raspberry Pi Kiosk System](https://github.com/TOLDOTECHNIK/Raspberry-Pi-Kiosk-Display-System) — LOW confidence (community project, validated approach) +- [Go Cross-Compilation for ARM64](https://dev.to/generatecodedev/how-to-cross-compile-go-applications-for-arm64-with-cgoenabled1-188h) — MEDIUM confidence (multiple corroborating sources; CGO complexity for LibSQL noted) +- [Building WebSocket Applications in Go](https://www.videosdk.live/developer-hub/websocket/go-websocket) — LOW confidence (corroborates hub pattern; well-established Go pattern) +- [SvelteKit Service Workers](https://kit.svelte.dev/docs/service-workers) — HIGH confidence (official SvelteKit docs on offline/PWA patterns) + +--- +*Architecture research for: Felt — Edge-cloud poker venue management platform* +*Researched: 2026-02-28* diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md new file mode 100644 index 0000000..32d1aa7 --- /dev/null +++ b/.planning/research/FEATURES.md @@ -0,0 +1,347 @@ +# Feature Research + +**Domain:** Poker venue management platform (live card rooms, bars, clubs, casinos) +**Researched:** 2026-02-28 +**Confidence:** MEDIUM — Core feature categories verified across multiple competitor products (TDD, Blind Valet, BravoPokerLive, LetsPoker, CasinoWare, kHold'em, TableCaptain). Feature importance is inferred from competitive analysis and forum discussions, not from direct operator interviews. + +--- + +## Feature Landscape + +### Table Stakes (Users Expect These) + +Features that every competing product has. Missing any of these makes Felt feel broken before operators even test differentiating features. + +#### Tournament Management + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Tournament clock (countdown + level display) | Every competitor has it; this is the core job | LOW | Must show level, time remaining, next blind, break info | +| Configurable blind structures | TDD, Blind Valet, CasinoWare all require this; no venue runs default blinds | MEDIUM | Presets + custom editor; include antes, bring-ins | +| Break management (scheduled + manual) | Standard tournament flow; operators must pause for chip-ups, meals | LOW | Break countdown, break message per level | +| Chip-up / denomination removal messaging | Standard tournament procedure; without it operators improvise manually | LOW | Custom message per break (e.g., "Chip up 5s") | +| Rebuy / add-on tracking | Required in almost all bar/club tournaments | MEDIUM | Per-player rebuy count, add-on windows, prize pool impact | +| Late registration window | Industry standard; extends prize pool, increases player counts | LOW | Configurable end-of-level or time-based cutoff | +| Bust-out tracking | Required for payout ordering and seating consolidation | MEDIUM | Player elimination order, timestamp, chip count at bust | +| Prize pool calculation | Operators need this instantly; manual math causes errors | MEDIUM | Rake config, bounties, guaranteed pools, overflows | +| Payout structure (fixed % or variable) | Every serious software has this; without it TDs use paper | MEDIUM | ICM, fixed %, custom splits; print-ready output | +| Player registration / database | Venues know their regulars; re-entry and rebuy require player identity | MEDIUM | Name, contact, history; import from TDD | +| Seating assignment (random + manual) | Required for fair tournament start | MEDIUM | Auto-randomize, drag-and-drop adjustments | +| Table balancing | Required as players bust out; manual is error-prone | HIGH | Auto-suggest moves, seat assignment notifications | +| Table break / consolidation | Tables must close as field shrinks | MEDIUM | Triggers at N players, assigns destination seats | +| Multi-tournament support | Venues run satellites alongside mains; bars run concurrent events | HIGH | Independent clocks, financials, player DBs per tournament | +| Pause / resume clock | Universal need; phone calls, disputes, manual delays | LOW | Single button; broadcasts to all displays | +| Hand-for-hand mode | Required at bubble; TDA rules mandate clock management | MEDIUM | Stops timer, 2-3 min per hand deduction | +| Bounty tournament support | Standard format at most venues | MEDIUM | Progressive bounty tracking, per-player bounty calculations | +| Re-entry tournament support | Common format; distinct from rebuy | MEDIUM | New entry = new stack, maintains same player identity | + +#### Cash Game Management + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Waitlist management | Core cash game operation; BravoPokerLive built entirely on this | MEDIUM | By game type + stakes; order preserved; auto-seat | +| Table status board | Operators and players need to see what games are running | MEDIUM | Open/closed, game type, stakes, seat count, occupied seats | +| Seat-available notification | Players hate watching the board; notification is now expected | MEDIUM | SMS or push; BravoPokerLive popularized this | +| Game type + stakes configuration | Venues run multiple games (NLH, PLO, mixed); must be configurable | LOW | Poker variant, stakes level, min/max buy-in | +| Session tracking (buy-in/cashout) | Required for rake calculation and financial reporting | MEDIUM | Player buy-in amounts, cashout amounts, duration | +| Rake tracking | Venues live on rake; they must see it per table and in aggregate | MEDIUM | Percentage, time charge, or flat fee per pot | +| Must-move table handling | Standard practice at busy rooms with main + must-move games | HIGH | Automated progression rules, priority queue | +| Seat change request queue | Standard poker room procedure; without it conflicts arise | LOW | Players queue for preferred seat, FIFO, dealer notification | + +#### Display / Signage + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Tournament clock display (dedicated screen) | Any software that can't show a big clock on a TV is unusable | MEDIUM | Full-screen view: time, level, blinds/antes, next level | +| Seating board display | Venues show draw results and table assignments on screens | MEDIUM | Table/seat grid, player names, color-coded | +| Rankings / chip count display | Players want to see standings; operators use it to drive engagement | MEDIUM | Sorted leaderboard, update on eliminations | +| Upcoming events / schedule display | Venues run this on idle screens; basic signage need | LOW | Schedule rotation between operational views | +| Multi-screen support | One screen is not enough for a venue | MEDIUM | Each display independently assigned a view | + +#### Player-Facing Access + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| Mobile clock view (player phone) | Blind Valet and LetsPoker both offer this; players expect it now | MEDIUM | QR code access, no app install, live blind/time view | +| Player standings on mobile | Players check their ranking constantly; it reduces questions to staff | MEDIUM | Live rank, chip count if entered, position relative to bubble | +| Waitlist position on mobile | Players want to leave and be notified; board-watching is dying | MEDIUM | Live position number, estimated wait time | + +#### Operational Foundations + +| Feature | Why Expected | Complexity | Notes | +|---------|--------------|------------|-------| +| PIN / role-based authentication | Multi-staff venues need access control | MEDIUM | Floor manager, dealer, observer roles; offline PIN | +| Audit trail (state change log) | Disputes happen; operators need a record | MEDIUM | Who did what, when; immutable append-only log | +| Data export (CSV / JSON) | Operators archive results, import to spreadsheets, submit to leagues | LOW | Tournament results, player history, financial summary | +| Offline operation | Venues lose internet; TDD's biggest advantage over cloud-only competitors | HIGH | Full tournament run with zero cloud dependency | +| Print output (registration slips, payout sheets) | Physically handing a player their payout amount is still standard | LOW | Browser print, no special driver required | + +--- + +### Differentiators (Competitive Advantage) + +Features that no single competitor offers completely, or where the gap between current products and user expectations is large enough to win venues. + +| Feature | Value Proposition | Complexity | Notes | +|---------|-------------------|------------|-------| +| Wireless display nodes (no HDMI cables) | Eliminates the #1 physical deployment pain point for venues | HIGH | Pi Zero 2W kiosk over WireGuard mesh; unique to Felt | +| Offline-first edge architecture | Cloud-only products (Blind Valet) die without internet; TDD has no cloud sync | HIGH | Leaf Node runs full tournament autonomously; game changer for reliability | +| Player mobile PWA (no app install) | App installs kill adoption; QR → instant live data is frictionless | MEDIUM | Progressive Web App, works on any phone | +| Integrated digital signage with WYSIWYG editor | Venues pay for separate signage software (Yodeck, NoviSign); Felt combines them | HIGH | Template gallery, AI content generation, venue branding, playlists | +| AI-assisted promo content generation | No competitor offers AI imagery for drink specials, event promos | HIGH | Generates branded images; reduces need for a graphic designer | +| League / season management with point formulas | TDD has leagues; Blind Valet has leagues; but neither integrates with operations | MEDIUM | Configurable scoring, automatic season standings, archive | +| Events engine (triggers + actions) | No competitor has this; "when bubble bursts, play sound, change display, send message" | HIGH | Rule-based automation; webhook support; unlocks countless automations | +| Cross-venue platform identity (player belongs to Felt, not venue) | BravoPokerLive does discovery; no one does cross-venue player identity for non-casinos | HIGH | Network effects; players carry their history across venues | +| TDD data import | Critical for adoption; operators have years of data in TDD | MEDIUM | Import blind structures, player DB, tournament history, league records | +| Touch-native dark-room operator UI | TDD's UI is 2002-era; all competitors have mediocre UX | HIGH | Mobile-first, one-handed operation in a dark card room | +| Dealer tablet module (bust-outs / rebuys at table) | LetsPoker has this; no other competitor does for live venues | MEDIUM | Reduces floor staff movement; improves accuracy | +| Multi-room / simultaneous tournament management | Casinos need this; no small-venue product handles it | HIGH | Independent rooms under one operator account | +| Sound engine (level-up sounds, break alerts, bubble fanfare) | TDD has customizable sounds; most cloud products don't | LOW | Per-event sound mapping; controllable from operator UI | +| Sponsor ad rotation on signage | Venues monetize signage; no poker software does this natively | LOW | Ad slot in signage playlist with scheduling | +| Rake analytics dashboard | Most products track rake but don't visualize it usefully | MEDIUM | Per-table, per-game, per-session revenue reporting | +| Player notifications (seat, waitlist, next tournament) | Push/SMS when seat ready; LetsPoker does this; most don't | MEDIUM | Seat available, waitlist position change, event reminder | +| Venue public presence page | PokerAtlas does discovery; Felt can combine venue management + public presence | MEDIUM | Public schedule, event registration, venue info | + +--- + +### Anti-Features (Commonly Requested, Often Problematic) + +| Feature | Why Requested | Why Problematic | Alternative | +|---------|---------------|-----------------|-------------| +| Payment processing / chip cashout | "Complete the money loop in one system" | Gambling license requirements, PCI compliance, massive regulatory complexity; Felt is a management tool not a payment processor | Track amounts, integrate with existing cage/cashier workflow; show totals clearly | +| Online poker gameplay | "Run online tourneys too" | Entirely different product (RNG, real-money gambling regulation, anti-fraud); would consume years of development for a different market | Keep focus on live venues; out of scope per PROJECT.md | +| Video streaming / broadcast | "Stream our home game" | Different infrastructure (CDN, video encoding, latency constraints); adds enormous complexity for a niche use case | Partner with OBS/Streamlabs if needed; do not build | +| Crypto / blockchain payments | "Accept ETH for buy-ins" | Volatile value, regulatory uncertainty, operational complexity for floor staff; 2024 trend but wrong for this market | Cash is still king in live poker; not needed | +| Real-money gambling regulation features | "We need GRC built in" | Each jurisdiction has different requirements; would require constant legal maintenance | Operate as a management tool; compliance is venue's responsibility | +| BYO hardware support | "Can I run it on my old laptop?" | Support burden becomes enormous; hardware variance causes reliability issues; undermines offline guarantees | Ship pre-configured Leaf Nodes; free tier runs on Felt cloud | +| Separate mobile apps (iOS/Android) for phase 1-3 | "Native app feels better" | Play Store / App Store review cycles slow iteration; PWA covers 95% of use cases for this domain | PWA for players; native apps are Phase 4 only | +| Multi-currency / regional financial rules | "Support Euro AND GBP" | Display-only currency is fine; actual financial rule compliance per jurisdiction is a legal minefield | Show currency symbol as configuration; no financial calculation changes | +| Real-time chip count entry by all players | "Players should enter their own chip counts" | Cheating surface; operational chaos; floor staff integrity is paramount | Optional chip count entry by floor staff only with audit trail | +| Complex tournament scheduling / calendar system | "Let players register online for next month's events" | Phase 3 feature only; in phase 1-2, the complexity diverts from core operational tools | Static event schedule display in phase 1; online registration in phase 3 | +| Third-party casino management system (CMS) integration | "Connect to IGT/Bally/Aristocrat" | Enterprise sales cycle, NDA/API access requirements, proprietary protocols; not needed for target market | Target venues are non-casino; casinos use Bravo/ACSC anyway | +| Staking / backing / action splitting | "Track who owns % of players" | Legal complexity, financial tracking burden, scope far beyond venue operations | Out of scope; operators who want this use dedicated tools | + +--- + +## Feature Dependencies + +``` +[Player Database] + └──requires──> [Tournament Registration] + └──requires──> [Tournament Clock Engine] + └──requires──> [Blind Structure Config] + +[Table Balancing] + └──requires──> [Seating Assignment] + └──requires──> [Player Database] + +[Waitlist Management] (cash game) + └──requires──> [Player Database] + └──requires──> [Table Status Board] + +[Display System] + └──requires──> [Display Node Registry] + └──requires──> [Tournament Clock Engine] (for tournament views) + └──requires──> [Table Status Board] (for cash game views) + +[Player Mobile PWA] + └──requires──> [Tournament Clock Engine] (for live data) + └──requires──> [Waitlist Management] (for position tracking) + └──requires──> [Netbird reverse proxy] (for external access) + +[League Management] + └──requires──> [Tournament results export / history] + └──requires──> [Player Database] + +[Events Engine] + └──requires──> [Tournament Clock Engine] (trigger source) + └──requires──> [Display System] (action target) + +[Dealer Tablet Module] + └──requires──> [Tournament Registration] + └──requires──> [Bust-out Tracking] + └──requires──> [Rebuy/Add-on Tracking] + +[Digital Signage / Promo Content] + └──requires──> [Display Node Registry] + └──enhances──> [Events Engine] (content triggers) + +[Analytics / Revenue Reporting] + └──requires──> [Rake Tracking] (cash games) + └──requires──> [Tournament Financial Engine] (tournaments) + └──requires──> [Session Tracking] + +[NATS Sync / Offline Replay] + └──requires──> [Leaf Node embedded NATS] + └──enhances──> [All real-time features] (propagates state to clients) + +[Loyalty / Points System] + └──requires──> [Player Database] + └──requires──> [Session Tracking] + └──requires──> [Tournament Registration] + +[Public Venue Presence] + └──requires──> [Core cloud layer] + └──requires──> [Player Database] (cross-venue identity) +``` + +### Dependency Notes + +- **Tournament Clock requires Blind Structure Config:** You cannot run a clock without a structure. Build the structure editor before the clock UI. +- **Table Balancing requires Seating Assignment:** Balancing suggests moves; you need an assigned seating model first. +- **Player Mobile PWA requires Netbird reverse proxy:** Players accessing from outside the venue LAN need the reverse proxy tunnel. Must be available from day 1 of mobile features. +- **Dealer Tablet requires Bust-out + Rebuy tracking:** The tablet is just an input surface for existing state; build the state machine first. +- **Events Engine enhances almost everything:** Build core features without it first; add Events Engine as an automation layer on top. +- **Digital Signage conflicts with Lean MVP scope:** Signage is a differentiator but adds significant complexity (content editor, playlist, scheduling). Build operational displays first; WYSIWYG editor is phase 1 stretch, AI content generation is phase 2+. +- **League Management requires completed tournament results:** You cannot compute standings without a completed tournament history model. League features must come after the core tournament lifecycle is complete. + +--- + +## MVP Definition + +### Launch With (v1) — Tournament Operations Core + +Minimum viable product for a venue to replace TDD + a whiteboard and run a complete tournament with Felt. + +- [ ] Tournament clock engine (countdown, levels, breaks, pause/resume) — core job +- [ ] Blind structure configuration (custom + presets, antes, chip-up messages) — cannot run without +- [ ] Player registration + bust-out tracking — needed for payout and table management +- [ ] Rebuy / add-on / late registration — needed for 90% of real tournaments +- [ ] Prize pool calculation + payout structure — operators need this to pay out correctly +- [ ] Table seating assignment (random + manual) + table balancing — required for multi-table events +- [ ] Financial engine (buy-ins, rake, bounties) — venues need to track cash flow +- [ ] Display system: clock view + seating view on dedicated screens — the room needs to see the clock +- [ ] Player mobile PWA: live clock + blinds + personal rank — replaces asking the floor every 2 minutes +- [ ] Offline-first operation (zero internet dependency during tournament) — reliability requirement +- [ ] Role-based auth (operator PIN offline, OIDC online) — floor staff access control +- [ ] TDD data import (blind structures + player DB) — adoption enabler for existing TDD users +- [ ] Data export (CSV, JSON, HTML print output) — venues archive results, submit to leagues + +### Add After Validation (v1.x) — Tournament Enhancement + Cash Game Foundations + +Add once the core tournament loop is proven reliable in production. + +- [ ] League / season management — triggered when venues start asking for standings tracking +- [ ] Hand-for-hand mode — needed before any venue runs a serious event with a bubble +- [ ] Events engine (sound triggers, view changes on level-up, break start) — high value, low user cost +- [ ] Digital signage (schedule/promo display between operational views) — venues want their idle screens working +- [ ] Waitlist management + table status board — prerequisite for cash game operations +- [ ] Session tracking (buy-in / cashout / duration) — cash game financial foundation +- [ ] Rake tracking per table — cash game revenue visibility +- [ ] Seat-available notifications (push / SMS) — player retention for cash games +- [ ] Must-move table logic — required at any venue with main game protection +- [ ] Dealer tablet module (bust-outs + rebuys at table) — reduces floor staff walking + +### Future Consideration (v2+) — Platform Maturity + +Defer until core product-market fit is established across tournament and cash game operations. + +- [ ] WYSIWYG content editor + AI promo generation — Phase 1 is out-of-the-box templates; WYSIWYG editor is Phase 2 after operational features are solid +- [ ] Dealer management (scheduling, rotations, clock-in) — Phase 3; operators won't churn over this +- [ ] Player loyalty / points system — Phase 3; high complexity, needs player identity first +- [ ] Public venue presence page + online event registration — Phase 3; requires stable Core cloud layer +- [ ] Analytics dashboards (revenue, player retention, game popularity) — Phase 3; operators need basic reporting first +- [ ] Private venues + membership management — Phase 3; niche use case (private clubs) +- [ ] Native iOS / Android apps (player + operator) — Phase 4; PWA covers the use case until then +- [ ] Social features (friends, activity feed, achievements) — Phase 4; requires large player network first +- [ ] Cross-venue player leaderboards — Phase 4; requires critical mass of venues on the platform + +--- + +## Feature Prioritization Matrix + +| Feature | User Value | Implementation Cost | Priority | +|---------|------------|---------------------|----------| +| Tournament clock engine | HIGH | LOW | P1 | +| Blind structure config | HIGH | LOW | P1 | +| Player registration + bust-out | HIGH | MEDIUM | P1 | +| Rebuy / add-on | HIGH | MEDIUM | P1 | +| Payout calculation + structure | HIGH | MEDIUM | P1 | +| Table seating + balancing | HIGH | HIGH | P1 | +| Display system (clock + seating views) | HIGH | HIGH | P1 | +| Player mobile PWA (clock + rank) | HIGH | MEDIUM | P1 | +| Offline-first operation | HIGH | HIGH | P1 | +| TDD data import | HIGH | MEDIUM | P1 | +| Financial engine (buy-ins, rake) | HIGH | MEDIUM | P1 | +| Role-based auth (PIN + OIDC) | HIGH | MEDIUM | P1 | +| Waitlist management | HIGH | MEDIUM | P2 | +| Cash game table status board | HIGH | MEDIUM | P2 | +| Session + rake tracking | HIGH | MEDIUM | P2 | +| Events engine | MEDIUM | HIGH | P2 | +| League management | MEDIUM | MEDIUM | P2 | +| Dealer tablet module | MEDIUM | MEDIUM | P2 | +| Hand-for-hand mode | HIGH | LOW | P2 | +| Seat-available notifications | MEDIUM | MEDIUM | P2 | +| Must-move logic | MEDIUM | HIGH | P2 | +| Digital signage (playlist + schedule) | MEDIUM | HIGH | P2 | +| WYSIWYG content editor | MEDIUM | HIGH | P3 | +| AI content generation | LOW | HIGH | P3 | +| Dealer management / scheduling | MEDIUM | HIGH | P3 | +| Loyalty points system | MEDIUM | HIGH | P3 | +| Public venue presence | MEDIUM | HIGH | P3 | +| Analytics dashboards | MEDIUM | HIGH | P3 | +| Native mobile apps | LOW | HIGH | P3 | +| Social / friend features | LOW | HIGH | P3 | + +**Priority key:** +- P1: Must have for any operator to run a tournament with Felt +- P2: Must have to expand beyond early adopters and cover cash game venues +- P3: Must have to compete at enterprise / casino tier + +--- + +## Competitor Feature Analysis + +| Feature | TDD | Blind Valet | BravoPokerLive | LetsPoker | Felt (planned) | +|---------|-----|-------------|----------------|-----------|----------------| +| Tournament clock | Yes (deep) | Yes (basic) | No | Yes | Yes (deep) | +| Blind structure editor | Yes (deep) | Yes | No | Yes | Yes (deep + presets) | +| Player registration | Yes | Yes (basic) | Player-side only | Yes | Yes | +| Rebuy / add-on | Yes | Yes | No | Yes | Yes | +| Bust-out tracking | Yes | Yes | No | Yes | Yes | +| Table balancing | Yes | No | No | Yes | Yes | +| Multi-tournament | Yes (limited) | No | No | Yes | Yes | +| Payout calculator | Yes | Yes | No | Yes | Yes | +| League management | Yes | Yes | No | Yes | Yes | +| Display (TV clock) | Yes (HDMI) | Browser | No | Yes (browser) | Yes (wireless nodes) | +| Player mobile access | No | Yes | Yes (waitlist) | Yes | Yes (PWA) | +| Waitlist management | No | No | Yes (core feature) | Yes | Yes | +| Cash game table board | No | No | Yes | Yes | Yes | +| Session / rake tracking | No | No | No | Partial | Yes | +| Must-move tables | No | No | Partial | No | Yes | +| Offline operation | Yes (Windows only) | No | No | No | Yes (ARM SBC) | +| Cloud sync | No | Yes | Yes | Yes | Yes (NATS) | +| Cross-platform | Windows only | Browser | iOS/Android | All | Browser / PWA | +| Digital signage | No | No | No | No | Yes (integrated) | +| Events engine | Yes (scripts) | No | No | No | Yes | +| AI content generation | No | No | No | No | Yes (planned) | +| Dealer tablet module | No | No | No | Yes | Yes (Phase 2) | +| Loyalty system | No | No | No | No | Yes (Phase 3) | +| Dealer management | No | No | No | No | Yes (Phase 3) | +| Analytics dashboard | Minimal | No | Partial (player app) | Partial | Yes (Phase 3) | +| Import from TDD | N/A | No | No | No | Yes | +| Wireless displays | No | No | No | No | Yes (unique) | +| Platform player identity | No | No | Partial (national) | Partial | Yes (cross-venue) | + +--- + +## Sources + +- [The Tournament Director](https://www.thetournamentdirector.net/) — feature research (403 on direct fetch; confirmed via WebSearch and community forums) +- [Blind Valet](https://blindvalet.com/) — features page fetched directly +- [BravoPokerLive App Store](https://apps.apple.com/us/app/bravopokerlive/id470322257) — feature descriptions +- [LetsPoker](https://lets.poker/) — confirmed via [PokerNews LetsPoker article](https://www.pokernews.com/news/2022/02/letspoker-app-looks-to-revolutionize-the-world-of-live-poker-40689.htm) and [LetsPoker features article](https://lets.poker/articles/letspoker-features/) +- [CasinoWare](https://www.casinoware.net/) — features fetched directly +- [kHold'em](https://www.kholdem.net/en/) — features fetched directly +- [PokerAtlas TableCaptain](https://www.pokeratlas.com/info/table-captain) — confirmed via WebSearch (403 on direct fetch) +- [PokerNews: Best Poker Table Management Software](https://www.pokernews.com/strategy/what-is-the-best-poker-table-management-software-48189.htm) — category overview +- [Poker Chip Forum: What tournament management solution are you using?](https://www.pokerchipforum.com/threads/what-tournament-management-solution-are-you-using.106524/) — operator pain points (403 on direct fetch; confirmed via WebSearch) +- [Poker Chip Forum: Not impressed with tournament director](https://www.pokerchipforum.com/threads/not-really-impressed-with-tournament-director-software-any-interest-in-this.69632/) — TDD limitations +- [Home Poker Tourney: Clock Features Chart](https://homepokertourney.org/clocks-chart.htm) — feature comparison matrix fetched directly +- [3UP Gaming: Compare Poker Tournament Software](https://www.3upgaming.com/blog/compare-the-best-poker-tournament-software-providers/) — provider comparison fetched directly +- [Technology.org: Optimizing Poker Room Operations](https://www.technology.org/2025/01/23/optimizing-poker-room-operations-with-advanced-software/) — operational features (403 on direct fetch; confirmed via WebSearch) +- [3UP Gaming: Poker Room Waiting List App](https://www.3upgaming.com/blog/poker-room-waiting-list-app-the-best-tools-to-manage-your-poker-games) — waitlist feature analysis + +--- +*Feature research for: poker venue management platform (Felt)* +*Researched: 2026-02-28* diff --git a/.planning/research/PITFALLS.md b/.planning/research/PITFALLS.md new file mode 100644 index 0000000..8c5a740 --- /dev/null +++ b/.planning/research/PITFALLS.md @@ -0,0 +1,456 @@ +# Pitfalls Research + +**Domain:** Edge-cloud poker venue management platform (offline-first, ARM64 SBC, real-time display sync, multi-tenant SaaS) +**Researched:** 2026-02-28 +**Confidence:** MEDIUM-HIGH (critical pitfalls verified against multiple sources; some domain-specific items from training data flagged) + +--- + +## Critical Pitfalls + +### Pitfall 1: NATS JetStream Data Loss Under Default fsync Settings + +**What goes wrong:** +NATS JetStream's default `sync_interval` is 2 minutes — meaning acknowledged messages are not guaranteed to be on disk before the ACK is sent to the client. A kernel crash, power loss, or sudden SBC shutdown on the Leaf node can result in losing messages that were already ACK'd by the broker. The December 2025 Jepsen analysis of NATS 2.12.1 confirmed this: "NATS JetStream can lose data or get stuck in persistent split-brain in response to file corruption or simulated node failures." Even a single power failure can trigger loss of committed writes. + +**Why it happens:** +The lazy fsync default prioritizes throughput over durability. This is the correct tradeoff for most messaging workloads. However, tournament state sync is financial-adjacent: a lost "player busted" or "rebuy processed" event corrupts the prize pool ledger and produces incorrect final payouts. Developers assume "acknowledged" means "durable." + +**How to avoid:** +Set `sync_interval: always` on the embedded NATS server configuration on the Leaf node. This makes NATS fsync before every acknowledgement. Accept the throughput reduction — tournament events are low-frequency (< 100 events/minute) so this has zero practical impact. Verify this setting is in the embedded server config before first production deploy. + +```yaml +# nats-server.conf for Leaf +jetstream { + store_dir: /var/lib/felt/jetstream + sync_interval: always +} +``` + +**Warning signs:** +- Default config template copied from NATS quickstart docs (which do not set `sync_interval`) +- Tournament events not replaying correctly after SBC reboot in testing +- "Event count mismatch" after power-cycle tests + +**Phase to address:** Phase 1 (NATS JetStream integration) — bake this into the embedded server bootstrap config, not as a post-launch fix. + +--- + +### Pitfall 2: LibSQL/SQLite WAL Checkpoint Stall Causing Write Timeouts + +**What goes wrong:** +SQLite in WAL mode accumulates changes in the WAL file until a checkpoint copies them back to the main database. Under sustained write load (active tournament with frequent state updates), the WAL file can grow unbounded if readers hold long transactions — preventing checkpoint from completing. When a checkpoint finally runs (FULL or RESTART mode), it briefly blocks all writers, causing the tournament clock update goroutine to queue up and miss a 1-second tick. Separately, running LibSQL's sync operation while the local WAL is being actively written risks data corruption (documented in LibSQL issue #1910). + +**Why it happens:** +Developers set WAL mode and consider concurrency "solved." They miss that: (1) WAL autocheckpoint defaults to PASSIVE mode which skips when readers are present, (2) uncontrolled WAL growth degrades read performance as SQLite must scan further into WAL history, and (3) LibSQL sync must not overlap with active write transactions. + +**How to avoid:** +- Set `PRAGMA wal_autocheckpoint = 0` to disable automatic checkpointing, then schedule explicit `PRAGMA wal_checkpoint(TRUNCATE)` during quiet periods (e.g., break periods in tournament, level transitions). +- Set `PRAGMA journal_size_limit = 67108864` (64MB) to cap WAL file size. +- Never initiate LibSQL cloud sync during an active database write transaction — gate sync on a mutex with the write path. +- Use `PRAGMA busy_timeout = 5000` to avoid immediate failures when contention occurs. + +**Warning signs:** +- Tournament clock drifting by 1-2 seconds during heavy rebuy periods +- WAL file growing past 50MB during long tournaments +- LibSQL sync errors logged during high-activity periods + +**Phase to address:** Phase 1 (database layer setup) — configure these pragmas in the database initialization code path, not as tuning afterthoughts. + +--- + +### Pitfall 3: Offline-First Sync Conflict Producing Incorrect Prize Pool Ledger + +**What goes wrong:** +The Leaf node operates offline. If an operator on the venue tablet and a player viewing their phone PWA both initiate actions that affect the same tournament state (e.g., operator marks player as busted while player is registering a rebuy via PWA), conflicting events arrive at the NATS stream in undefined order when connectivity resumes. A Last-Write-Wins or timestamp-based merge on financial records corrupts the prize pool: a rebuy that was processed offline gets silently dropped, and the player is paid out less than they are owed. + +**Why it happens:** +Developers treat all offline sync as equal. Financial ledger mutations (buy-ins, rebuys, payouts) are not idempotent by default. The LibSQL sync "last-push-wins" default is dangerous for financial records where all writes must be applied in the correct order. Timestamp-based ordering fails on SBCs where system clocks can drift. + +**How to avoid:** +- Treat financial transactions as an append-only event log, never as mutable rows. Each buy-in, rebuy, add-on, and payout is an immutable event with a monotonic sequence number assigned by the Leaf node. +- Never use wall clock timestamps to order conflicting financial events — use lamport clocks or NATS sequence numbers as the canonical ordering. +- The prize pool balance is always derived from the event log (computed, never stored directly), so a re-play of all events always produces the correct total. +- Mark financial events as requiring explicit human conflict resolution (surface a UI alert) rather than auto-merging on the rare case of genuine conflicts. + +**Warning signs:** +- Prize pool total doesn't match sum of individual player buy-in records +- Rebuy count in player history differs from rebuy count in prize pool calculation +- Sync error logs showing sequence number gaps in NATS stream replay + +**Phase to address:** Phase 1 (financial engine design) — must be an architectural decision, not retrofittable. + +--- + +### Pitfall 4: Pi Zero 2W Memory Exhaustion Crashing Display Node + +**What goes wrong:** +The Raspberry Pi Zero 2W has 512MB RAM. Chromium running a full WebGL/Canvas display (animated tournament clock, live chip counts, scrolling rankings) can consume 300-400MB alone, leaving minimal headroom. Memory pressure causes the kernel OOM killer to terminate Chromium mid-display. Without a proper watchdog and restart mechanism, the display node goes dark silently — venue staff don't notice until a player complains. Over multi-hour tournaments, memory leaks in JavaScript (uncollected WebSocket message handlers, accumulated DOM nodes) compound this. + +**Why it happens:** +Development happens on a desktop or full Raspberry Pi 4 with 4-8GB RAM. The Zero 2W constraint is not felt until hardware testing. Display views are designed with visual richness in mind, without profiling memory consumption on the target hardware. WebSocket reconnection handlers that fail to deregister previous listeners create unbounded listener growth. + +**How to avoid:** +- Test ALL display views on actual Pi Zero 2W hardware from day one, not just functionality but memory usage (use `chrome://memory-internals` or external monitoring). +- Set Chromium flags: `--max-old-space-size=256 --memory-pressure-off` — counterintuitively, disabling pressure signals can prevent thrashing. +- Enable `zram` on the Pi Zero 2W (compressed swap) — adds ~200MB effective memory, documented to make Chromium "usable" on constrained devices. +- Implement a kiosk watchdog service (systemd `Restart=always` + `MemoryMax=450M`) that restarts Chromium if it exceeds memory limits. +- Use Server-Sent Events (SSE) instead of WebSocket for display-only views — reduces connection overhead and eliminates bidirectional state machine complexity where one-way push is sufficient. +- Implement manual listener cleanup in all WebSocket event handlers: always call `removeEventListener` in cleanup functions. + +**Warning signs:** +- Display works fine for 30 minutes, then goes blank +- `dmesg` on Pi Zero shows `oom-kill` entries +- Memory usage climbs monotonically over tournament duration when profiling + +**Phase to address:** Phase 1 (display node MVP) — must validate on target hardware before building more display views. + +--- + +### Pitfall 5: Tournament Table Rebalancing Algorithm Producing Unfair or Invalid Seating + +**What goes wrong:** +Table rebalancing when players bust out is operationally critical and algorithmically subtle. Common failures: (1) moving a player who just posted their blind, creating a situation where they post blind twice before being able to act; (2) breaking a table that still has enough players to stay open; (3) choosing a player to move who has the dealer button (invalid in most rule sets); (4) the algorithm enters an infinite loop when there is no valid move (e.g., exactly balanced tables that can't be further balanced without breaking one). The Tournament Director software has a known bug where "if the dealer button is set to a non-valid seat, a table balance can cause the application to lock-up." + +**Why it happens:** +Developers implement the "move player from biggest table to smallest table" happy path, then discover edge cases through production tournaments. Poker TDA rules for balancing are complex and context-dependent. The interaction between dealer button position, blind positions, and move eligibility is not obvious. + +**How to avoid:** +- Implement rebalancing as a pure function that takes complete tournament state and returns a list of moves — enables exhaustive unit testing with edge cases. +- Consult the Poker TDA Rules (2024 edition) as the authoritative reference for rebalancing procedures (Rules 25-28 cover table balancing and player movement). +- Test edge cases explicitly: single player remaining, two players at same table count, dealer button at last seat, player in small blind position being the move candidate. +- Expose a "dry run" rebalancing mode in the UI that shows proposed moves before executing — operators can catch bad suggestions before players are physically moved. +- Never auto-apply rebalancing; always require operator confirmation. + +**Warning signs:** +- Players complaining about double-posting blinds after a move +- Tournament stuck unable to proceed after table break +- Rebalancing suggestion moving the dealer button holder + +**Phase to address:** Phase 1 (seating engine) — core algorithm must be implemented and unit tested exhaustively before tournament testing. + +--- + +### Pitfall 6: Multi-Tenant RLS Policy Leaking Venue Data at Application Layer + +**What goes wrong:** +PostgreSQL Row Level Security on Core provides database-level tenant isolation, but RLS has a critical operational failure mode: if the application sets the `app.tenant_id` session variable (the common pattern for RLS) using a connection pool that reuses connections between requests, a previous request's tenant ID can bleed into the next request's session. This is a documented thread-safety issue — "some requests were being authorized with a previous request's user id because the user id for RLS was being stored in thread-local storage and threads were being reused for requests." + +**Why it happens:** +RLS tutorials typically show setting `SET LOCAL app.tenant_id = $1` inside a transaction, which is safe. But connection pools that don't reset session state between checkouts, or code that uses `SET` instead of `SET LOCAL`, cause session variables to persist across requests. In a multi-tenant venue management platform, this means venue A could see venue B's player data. + +**How to avoid:** +- Always use `SET LOCAL app.tenant_id = $1` (transaction-scoped), never `SET app.tenant_id` (session-scoped). +- Use a connection pool that explicitly resets session state on checkout (pgBouncer in transaction mode is safer than session mode for this pattern). +- Add an application-layer assertion: before every query, verify that `current_setting('app.tenant_id')` matches the expected tenant from the JWT/session — log and reject any mismatch as a security event. +- Write integration tests that explicitly test cross-tenant isolation: authenticate as venue A, attempt to query venue B's data through all API endpoints. + +**Warning signs:** +- Flaky test failures where tenant data appears for wrong user (intermittent = session state bleed) +- Query results contain more rows than expected for a given venue's player count +- Connection pool configured in session mode (pgBouncer `pool_mode = session`) + +**Phase to address:** Phase 1 (Core backend data model) — RLS policies and the tenant isolation integration tests must be built before any multi-venue feature. + +--- + +### Pitfall 7: Financial Calculations Using Floating-Point Arithmetic + +**What goes wrong:** +Prize pool calculations, rake, bounties, and payout distributions calculated using `float64` accumulate representation errors. `0.1 + 0.2 != 0.3` in IEEE 754. A tournament with 127 players at €55 buy-in with 10% rake, split across 12 payout positions with percentage-based distribution, will produce cent-level errors that cascade: the sum of individual payouts does not equal the prize pool total, creating a money reconciliation error. In a live venue, this is a regulatory and reputational risk. + +**Why it happens:** +Go's `float64` feels precise enough for small numbers. Developers write `prizePool := float64(buyIns) * 0.9` and don't notice that `0.9` is not exactly representable. The error is invisible until summing many such calculations. + +**How to avoid:** +- Store and compute ALL financial values as `int64` representing the smallest currency unit (eurocents for EUR). €55 = 5500 cents. +- Never store or compute monetary values as `float64`. If you need percentage-based rake, multiply first then divide: `rake = (buyInCents * rakePercent) / 100` using integer division. +- For payout percentage splits (e.g., 35.7% to 1st place), compute each payout as `(prizePool * 357) / 1000` in integer arithmetic, then distribute any remainder (due to truncation) to the first payout position. +- Write a test that sums all individual payouts and asserts equality to the prize pool total — this test will catch floating-point drift immediately. + +**Warning signs:** +- Prize pool "total" displayed in UI differs by €0.01-0.02 from sum of individual payouts +- Rake calculation produces fractional cents +- Any use of `float64` in the financial calculation code path + +**Phase to address:** Phase 1 (financial engine) — this is a zero-compromise architectural constraint, not a later optimization. + +--- + +## Moderate Pitfalls + +### Pitfall 8: ARM64 CGO Cross-Compilation Blocking CI/CD + +**What goes wrong:** +Go cross-compilation for `GOARCH=arm64` is trivial for pure Go code: set `GOOS=linux GOARCH=arm64` and the toolchain handles everything. The moment any dependency uses CGO (C bindings), this breaks. CGO requires a target-specific C toolchain (`aarch64-linux-gnu-gcc`). LibSQL's Go bindings may require CGO. If not caught early, the CI/CD pipeline that builds the Leaf node binary fails on x86-64 build agents, blocking all Leaf deployments. + +**Prevention:** +- Audit all dependencies for CGO usage before finalizing the stack: `go build -v ./... 2>&1 | grep cgo`. +- For any CGO dependency, set up Docker Buildx with `--platform linux/arm64` multi-arch builds from the start. +- Alternatively, choose pure-Go alternatives where possible: `modernc.org/sqlite` (CGO-free SQLite driver) vs `mattn/go-sqlite3` (requires CGO). Validate LibSQL Go driver CGO requirements against the latest release. +- Use a dedicated ARM64 build runner (e.g., a Hetzner CAX11 ARM instance) as the canonical Leaf build environment rather than cross-compiling. + +**Warning signs:** +- `cannot execute binary file: Exec format error` when deploying to Leaf +- CI build succeeds on x86 runner but produces wrong binary +- Build scripts not setting `GOARCH` and `GOOS` explicitly + +**Phase to address:** Phase 1 (build system setup) — establish cross-compilation pipeline before writing any Leaf-specific code. + +--- + +### Pitfall 9: WebSocket State Desync — Server Restarts vs. Client State + +**What goes wrong:** +The Leaf backend restarts (deploy, crash, OOM). All WebSocket connections drop. When clients reconnect, they receive a "current state" snapshot. But if the snapshot is emitted before all pending database writes have completed (race condition between reconnect handler and write completion), clients receive a stale snapshot and operate on incorrect state. Operators may see a tournament clock that's 45 seconds behind reality, or chip counts from before the last bust-out. + +**Why it happens:** +WebSocket reconnect handlers typically emit state immediately on connection establishment. If the server restart was triggered by a deploy, in-flight writes from the last moments before restart may not have been committed. The reconnect handler races against database recovery. + +**How to avoid:** +- Implement sequence numbers on all state updates. Every WebSocket message carries a monotonic `seq` field. Clients detect gaps in sequence and request a full resync rather than trusting a partial update. +- On server restart, wait for LibSQL WAL to be fully checkpointed before accepting new WebSocket connections (add a health check gate). +- Implement idempotent state application on the client: applying the same state update twice produces the same result (prevents double-application of duplicate messages during reconnect window). +- Only one goroutine writes to a WebSocket connection at a time (Go WebSocket constraint) — use a dedicated send goroutine with a buffered channel. + +**Warning signs:** +- Tournament clock jumps backward after reconnect +- Chip counts inconsistent between two operators' screens after server restart +- Client receiving sequence numbers with gaps + +**Phase to address:** Phase 1 (real-time sync architecture) — must be designed correctly from the first WebSocket implementation. + +--- + +### Pitfall 10: SBC Hardware Reliability — Power Loss During Tournament + +**What goes wrong:** +The Leaf node (Orange Pi 5 Plus) loses power mid-tournament. Even with NVMe (superior to SD card), an unclean shutdown can corrupt the filesystem if writes were in-flight. LibSQL's WAL may be partially written. The NATS JetStream store directory may be in an inconsistent state. On restoration, the system may fail to start, or worse, start with corrupted state that appears valid. + +**Why it happens:** +SBC deployments in venues are not enterprise environments. Power strips get kicked. UPS systems are absent. NVMe is more reliable than SD but is not immune to corruption on power loss — the OS ext4 journal and SQLite WAL both need clean shutdown to guarantee consistency. + +**How to avoid:** +- Mount the NVMe with a journaling filesystem configured for ordered data mode: `ext4` with `data=ordered` (default on most distros, but verify). +- Run LibSQL with `PRAGMA synchronous = FULL` (or at minimum `NORMAL`) and `PRAGMA journal_mode = WAL` — already planned, but verify synchronous mode specifically. +- Configure NATS `sync_interval: always` (Pitfall 1 above) to ensure JetStream state is on disk before any event is ACK'd. +- Implement a daily backup cron job that copies the LibSQL database to a USB drive or cloud (Hetzner Storage Box) — gives a recovery point even if local corruption is total. +- Add a systemd `ExecStop` hook that runs `PRAGMA wal_checkpoint(TRUNCATE)` before the process exits, minimizing WAL state at shutdown. + +**Warning signs:** +- Venues skipping UPS/surge protector hardware +- Leaf node failing to start after power cycle during testing +- Filesystem errors in `dmesg` on startup after unclean shutdown + +**Phase to address:** Phase 1 (Leaf node infrastructure setup) — backup strategy and sync configuration must be part of initial deployment scripts. + +--- + +### Pitfall 11: GDPR Violation — Storing Player PII on Leaf Without Consent Mechanism + +**What goes wrong:** +Player names, contact details, and tournament history are stored on the Leaf node for offline operation. If the venue is in the EU (the target market is Danish/European), this constitutes processing of personal data under GDPR. Without explicit consent capture at registration, documented data retention policies, and a right-to-erasure mechanism, the venue operator is liable. Fines can reach €20 million or 4% of global annual turnover. The platform architecture (players belong to Felt, not venues) amplifies risk — Felt is the data controller for the platform-level player profile. + +**Why it happens:** +Tournament management software traditionally treats player data as operational, not personal. Developers focus on functionality first. GDPR compliance is deferred as "legal's problem." The offline-first architecture compounds this: data on edge devices is harder to audit and harder to delete on-request. + +**How to avoid:** +- Define the data model from day one with GDPR in mind: separate PII fields (name, email, phone) from operational data (chip count, tournament position). This allows selective erasure without corrupting tournament history. +- Implement a "right to erasure" API endpoint that anonymizes PII (replaces name with "Player [ID]", nullifies contact fields) while preserving tournament result records for statistical purposes. +- Leaf node must be encrypted at rest (LUKS — already planned). Verify LUKS is set up in the provisioning flow. +- Data retention: document a default retention policy (e.g., player PII deleted after 12 months of inactivity) and implement automated enforcement. +- Consent must be captured before storing player contact details — the player registration flow must include explicit consent. + +**Warning signs:** +- Player registration form collecting email/phone without a consent checkbox +- No data deletion API endpoint in the player management module +- Player data stored without any anonymization strategy for inactive accounts + +**Phase to address:** Phase 1 (player management) for anonymization model; Phase 3 (platform maturity) for full GDPR compliance workflow. + +--- + +### Pitfall 12: Netbird Management Server as a Single Point of Failure + +**What goes wrong:** +The Leaf node connects to the self-hosted Netbird management server to establish WireGuard peers. If the Netbird management server is down, new peer connections cannot be established. In practice, once WireGuard peers are established, they maintain connectivity without the management server. But initial Leaf node boot, new display node enrollment, and player PWA access (via reverse proxy) all require the management server. A Netbird server outage at the start of a tournament is a critical incident. + +**Why it happens:** +The Netbird management server is treated as infrastructure rather than a critical dependency. It runs on a single LXC container on Proxmox. Single-container deployments have no redundancy. Developers assume "it's just networking" and don't plan for management plane failures. + +**How to avoid:** +- Run Netbird management server with Proxmox backup jobs (PBS daily backup) so restoration is fast if the container fails. +- Implement a startup procedure that verifies Netbird connectivity before marking the Leaf as ready for tournament use — surfaces infrastructure failures before they affect operations. +- Once WireGuard peers are established on the Leaf and display nodes, they retain connectivity through management server outages (WireGuard doesn't need a control plane after handshake). Document this so staff know not to panic if the management UI is unreachable mid-tournament. +- Consider running Netbird management on a separate, simpler VM rather than the same Proxmox host as other Core services — reduces correlated failure risk. + +**Warning signs:** +- Netbird management server on the same LXC as other Core services (single failure domain) +- No monitoring/alerting on management server health +- New Leaf provisioning untested after simulated management server outage + +**Phase to address:** Phase 1 (infrastructure setup) — define the Netbird deployment topology before provisioning hardware. + +--- + +### Pitfall 13: Player PWA Stale Service Worker Serving Old App Version + +**What goes wrong:** +SvelteKit PWA service workers cache the application for offline use. When the operator deploys a new version of the player PWA, players who have the old version cached via service worker continue seeing the old app. If a new API contract is introduced (e.g., a new field in the tournament state WebSocket message), the old client silently ignores or mishandles it. In the worst case, an old client submits a rebuy request using the old API shape, which the new server rejects, and the player receives no feedback. + +**Why it happens:** +Service worker update mechanics are subtle. The browser downloads the new service worker but doesn't activate it until all existing tabs running the old worker are closed. In a venue environment, players' phones keep the browser open throughout the tournament (for live clock viewing). The tab never closes, so the update never activates. + +**How to avoid:** +- Configure the service worker to use `skipWaiting()` and `clients.claim()` to force immediate activation of new service worker versions — accept the tradeoff that this can disrupt in-flight requests. +- Implement a version header in all API responses. The client checks the server version on every WebSocket connection and forces a full page reload if the versions diverge. +- Use the `vite-pwa/sveltekit` plugin for zero-config PWA setup — it handles cache busting and update notifications correctly. +- During development, test service worker update behavior explicitly: deploy a version, open the app, deploy again, verify the client updates without manual browser restart. + +**Warning signs:** +- Players reporting "the clock stopped updating" after a deploy +- API errors logged for requests with old field names/shapes after a schema change +- Service worker version in browser devtools differs from current deploy + +**Phase to address:** Phase 1 (player PWA setup) — configure service worker update behavior before the PWA goes live. + +--- + +## Technical Debt Patterns + +| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable | +|----------|-------------------|----------------|-----------------| +| Store prize pool as `float64` | Faster initial implementation | Cent-level rounding errors, reconciliation failures | Never | +| Skip WAL checkpoint configuration | Works fine in dev | WAL grows unbounded under tournament load, write stalls | Never | +| Copy NATS default config | Fast bootstrap | Data loss on power failure | Never for financial events | +| Hard-code venue ID (single-tenant) | Simplifies first version | Full schema migration to add multi-tenancy later | Only for pre-alpha validation | +| Use `SET` instead of `SET LOCAL` for RLS tenant | Slightly simpler code | Cross-tenant data leak in connection pool | Never | +| Skip Pi Zero 2W hardware testing | Faster UI iteration | Memory issues only discovered in production | Never — test on target hardware early | +| Auto-apply table rebalancing | Faster UX | Incorrect moves enforced without operator awareness | Never in a live tournament | +| Mock Netbird/WireGuard in dev | Faster development cycle | Networking issues only found at venue deployment | Acceptable in unit test phase; must integration-test before deploy | + +--- + +## Integration Gotchas + +| Integration | Common Mistake | Correct Approach | +|-------------|----------------|------------------| +| NATS embedded server | Copy quickstart config with default `sync_interval` | Set `sync_interval: always` in the embedded server options struct before starting | +| LibSQL cloud sync | Initiate sync inside an open write transaction | Gate sync behind a mutex; never overlap sync with write transaction | +| LibSQL + Go | Use `mattn/go-sqlite3` CGO driver | Evaluate `modernc.org/sqlite` (CGO-free) or LibSQL's own Go driver — verify CGO requirement for ARM64 cross-compile | +| PostgreSQL RLS | Use `SET app.tenant_id` (session-scoped) | Use `SET LOCAL app.tenant_id` inside every transaction | +| Netbird reverse proxy | Route all player PWA traffic through management server | Use Netbird's peer-to-peer WireGuard path; management server is only for control plane | +| SvelteKit service worker | Use default Workbox cache-first strategy for API calls | Use network-first for API responses, cache-first only for static assets | +| Chromium kiosk on Pi Zero 2W | No memory limits, default flags | Set `--max-old-space-size=256`, enable `zram`, use systemd `MemoryMax` | +| Go WebSocket | Multiple goroutines writing to the same connection | Single dedicated send goroutine per connection; other goroutines push to a channel | + +--- + +## Performance Traps + +| Trap | Symptoms | Prevention | When It Breaks | +|------|----------|------------|----------------| +| Unbounded WAL growth | Reads slow down as WAL grows; checkpoint stalls block writes | Manual checkpoint scheduling, `journal_size_limit` pragma | > 50 concurrent writes without checkpoint, or after 2-hour tournament | +| Pi Zero 2W memory leak in display | Display goes blank mid-tournament | Explicit listener cleanup, memory profiling on target hardware | After 60-90 minutes of continuous display operation | +| NATS consumer storm on reconnect | All clients subscribe simultaneously after Leaf restart, overwhelming broker | Implement jittered reconnect backoff (50-500ms random) | > 20 concurrent display/player clients reconnecting simultaneously | +| Full RLS policy evaluation on every query | Slow queries as tournament grows | Index the `tenant_id` column; keep RLS policies simple (avoid JOINs in policy) | > 10,000 player records per venue | +| Broadcasting entire tournament state on every WebSocket message | Bandwidth spike on player PWA reconnect | Send deltas (only changed fields) after initial full-state sync | > 50 concurrent player PWA connections | + +--- + +## Security Mistakes + +| Mistake | Risk | Prevention | +|---------|------|------------| +| Storing player PII without encryption on Leaf | GDPR violation if device lost/stolen | LUKS full-disk encryption on NVMe (planned — verify it's in provisioning scripts) | +| Reusable Netbird enrollment key for all Leaf nodes | If key leaks, attacker can enroll rogue devices | Use one-time enrollment keys per Leaf node; rotate after provisioning | +| RLS bypass via direct database connection | Venue A reads venue B's data if DB credentials leak | Restrict DB user to application role only; no superuser credentials in app connection string | +| PIN authentication without rate limiting | Brute-force PIN in offline mode | Implement exponential backoff after 5 failed PIN attempts, lockout after 10 | +| Serving player PWA over HTTP (non-HTTPS) | Service workers require HTTPS; also exposes player data | All player-facing endpoints must terminate TLS (Netbird reverse proxy with Let's Encrypt) | + +--- + +## UX Pitfalls + +| Pitfall | User Impact | Better Approach | +|---------|-------------|-----------------| +| Auto-applying table rebalancing moves without confirmation | Operators don't know why players are being moved; incorrect moves go unchallenged | Always show proposed moves, require tap-to-confirm before executing | +| Tournament clock not visible on dark screens in a dim poker room | Operators squint, miss blind level changes | Dark-room-first design from day one: minimum 18pt font for clock, high contrast ratios > 7:1, Catppuccin Mocha base | +| Player PWA showing stale chip counts after reconnect | Players see incorrect stack sizes, distrust the platform | Show "last updated X seconds ago" indicator; force full resync on reconnect | +| Sound events playing at maximum volume | Dealers and players startled; venue disruption | Per-venue configurable volume, default to 50%, fade-in for alerts | +| Prize payout screen not showing running total vs. paid out total | Operator makes payout errors when managing multiple players simultaneously | Show real-time "remaining to pay out" counter on payout screen | + +--- + +## "Looks Done But Isn't" Checklist + +- [ ] **Tournament clock:** Verify pause/resume correctly adjusts all time-based triggers (blind level end, break start) — not just the display counter +- [ ] **Prize pool:** Verify sum of all individual payouts equals prize pool total (run automated reconciliation test) +- [ ] **Table rebalancing:** Verify algorithm handles all TDA edge cases: last 2-player table, dealer button seat, player in blind position +- [ ] **Offline mode:** Verify full tournament can run (including rebuys, bust-outs, level changes) with internet completely disconnected for 4+ hours +- [ ] **Display node restart:** Verify display node automatically rejoins and resumes correct view after reboot without operator intervention +- [ ] **NATS replay:** Verify all queued events replay correctly after Leaf comes back online after 8+ hours offline +- [ ] **RLS isolation:** Verify API endpoints return 0 results (not 403) for valid venue A token querying venue B data — 403 leaks resource existence +- [ ] **GDPR erasure:** Verify player PII deletion does not delete tournament result records (anonymize, don't delete) +- [ ] **NATS fsync:** Verify `sync_interval: always` is in the deployed Leaf configuration (not just development) +- [ ] **Pi Zero memory:** Verify display node shows no memory growth after 4-hour continuous tournament with `/usr/bin/free -m` monitoring + +--- + +## Recovery Strategies + +| Pitfall | Recovery Cost | Recovery Steps | +|---------|---------------|----------------| +| NATS data loss (wrong fsync default in production) | HIGH | Restore from last backup; replay any manually recorded tournament events; accept some data loss | +| SQLite corruption after power loss | MEDIUM | LibSQL can often repair WAL with `PRAGMA integrity_check`; if not, restore from daily backup | +| Prize pool floating-point error discovered post-tournament | HIGH | Manual audit of all transactions; correction requires agreement of all players involved | +| RLS cross-tenant data leak | HIGH | Immediate incident response; audit logs for all affected queries; notify affected venues per GDPR breach requirements (72-hour deadline) | +| Pi Zero display failure mid-tournament | LOW | Display nodes are stateless — reboot restores operation within 60 seconds; have spare Pi Zero on-site | +| Table rebalancing error (player moved incorrectly) | MEDIUM | Manual seat correction via operator UI; document as a tournament irregularity in audit log | +| Service worker serving stale PWA | LOW | Force browser refresh (player gesture); if critical, add server-side cache-busting header | + +--- + +## Pitfall-to-Phase Mapping + +| Pitfall | Prevention Phase | Verification | +|---------|------------------|--------------| +| NATS default fsync data loss | Phase 1 (NATS setup) | Integration test: write 100 events, power-cycle Leaf, verify all 100 replay correctly | +| LibSQL WAL checkpoint stall | Phase 1 (database initialization) | Load test: 500 writes/min for 2 hours, monitor WAL file size stays bounded | +| Offline sync financial conflict | Phase 1 (financial engine architecture) | Conflict test: process rebuy offline, simulate late online event with same sequence, verify ledger correctness | +| Pi Zero memory exhaustion | Phase 1 (display node MVP) | Soak test: run display on actual Pi Zero 2W hardware for 4 hours, monitor RSS | +| Table rebalancing algorithm | Phase 1 (seating engine) | Unit tests covering all TDA edge cases (min 20 cases); load test with simulated 40-table tournament | +| Multi-tenant RLS data leak | Phase 1 (Core backend) | Security test: verify all 30+ API endpoints return correct tenant-scoped data only | +| Float arithmetic in financials | Phase 1 (financial engine) | Automated test: sum of payouts must equal prize pool (run as CI gate) | +| ARM64 CGO cross-compile | Phase 1 (build system) | CI gate: ARM64 binary builds successfully and passes smoke test on Orange Pi 5 Plus | +| WebSocket state desync | Phase 1 (real-time sync) | Chaos test: restart Leaf server mid-tournament, verify all clients resync within 5 seconds | +| SBC power loss data corruption | Phase 1 (infrastructure) | Chaos test: hard-power-cycle Leaf mid-tournament 10 times, verify restart always recovers cleanly | +| GDPR compliance | Phase 1 (player management) + Phase 3 | Verify: right-to-erasure API anonymizes PII, preserves results; audit trail shows all PII access | +| Netbird management SPOF | Phase 1 (infrastructure design) | Test: take Netbird management offline, verify existing WireGuard peers retain connectivity | +| PWA stale service worker | Phase 1 (PWA setup) | Test: deploy v1, open app, deploy v2, verify client shows v2 without manual browser restart | + +--- + +## Sources + +- [NATS JetStream Anti-Patterns for Scale — Synadia](https://www.synadia.com/blog/jetstream-design-patterns-for-scale) (MEDIUM confidence — official vendor blog) +- [Jepsen: NATS 2.12.1 — jepsen.io](https://jepsen.io/blog/2025-12-08-nats-2.12.1) (HIGH confidence — independent analysis, Dec 2025) +- [NATS JetStream loses acknowledged writes by default — GitHub Issue #7564](https://github.com/nats-io/nats-server/issues/7564) (HIGH confidence — official tracker) +- [Downsides of Local First / Offline First — RxDB](https://rxdb.info/downsides-of-offline-first.html) (MEDIUM confidence — library author perspective) +- [SQLite Write-Ahead Logging — sqlite.org](https://sqlite.org/wal.html) (HIGH confidence — official documentation) +- [LibSQL Embedded Replicas Data Corruption — GitHub Discussion #1910](https://github.com/tursodatabase/libsql/discussions/1910) (HIGH confidence — official tracker) +- [Turso Offline Sync Public Beta](https://turso.tech/blog/turso-offline-sync-public-beta) (MEDIUM confidence — vendor announcement) +- [PostgreSQL RLS Implementation Guide — permit.io](https://www.permit.io/blog/postgres-rls-implementation-guide) (MEDIUM confidence — verified against AWS prescriptive guidance) +- [Multi-tenant Data Isolation with PostgreSQL RLS — AWS](https://aws.amazon.com/blogs/database/multi-tenant-data-isolation-with-postgresql-row-level-security/) (HIGH confidence — official AWS documentation) +- [Floats Don't Work for Storing Cents — Modern Treasury](https://www.moderntreasury.com/journal/floats-dont-work-for-storing-cents) (HIGH confidence — multiple corroborating sources) +- [SQLite WAL Checkpoint Starvation — sqlite-users](https://sqlite-users.sqlite.narkive.com/muT0rMYt/sqlite-wal-checkpoint-starved) (MEDIUM confidence — community discussion) +- [Chromium on Pi Zero 2W memory constraints — Raspberry Pi Forums](https://forums.raspberrypi.com/viewtopic.php?t=326222) (MEDIUM confidence — community-verified) +- [NetBird 2025 Guide: 5 Critical Mistakes](https://junkangworld.com/blog/your-2025-netbird-guide-5-critical-mistakes-to-avoid) (LOW confidence — third-party blog, verify against official docs) +- [SvelteKit Service Workers — official docs](https://kit.svelte.dev/docs/service-workers) (HIGH confidence — official documentation) +- [The Tournament Director — known bugs changelog](https://thetournamentdirector.net/changes.txt) (HIGH confidence — official changelog) +- [Poker TDA Rules 2013 — table balancing procedures](https://www.pokertda.com/wp-content/uploads/2013/08/Poker_TDA_Rules_2013_Version_1.1_Final_handout_PDF_redlines_from_2011_Rules.pdf) (MEDIUM confidence — check against current TDA ruleset) +- [GDPR compliance for gaming/gambling operators — GDPR Local](https://gdprlocal.com/gdpr-compliance-online-casinos-betting-operators/) (MEDIUM confidence — legal advisory blog, not authoritative) + +--- +*Pitfalls research for: edge-cloud poker venue management platform (Felt)* +*Researched: 2026-02-28* diff --git a/.planning/research/STACK.md b/.planning/research/STACK.md new file mode 100644 index 0000000..114761d --- /dev/null +++ b/.planning/research/STACK.md @@ -0,0 +1,278 @@ +# Stack Research + +**Domain:** Edge-cloud poker venue management platform (ARM64 SBC + cloud hybrid) +**Researched:** 2026-02-28 +**Confidence:** MEDIUM-HIGH (core stack verified via official sources; peripheral libraries verified via pkg.go.dev and GitHub releases; CGO cross-compilation complexity is a known risk requiring phase-specific validation) + +--- + +## Recommended Stack + +### Core Technologies + +| Technology | Version | Purpose | Why Recommended | +|------------|---------|---------|-----------------| +| Go | 1.26 | Backend runtime (Leaf + Core shared codebase) | Single binary deployment, ARM64 cross-compilation, goroutine concurrency for real-time tournament state, excellent stdlib HTTP. Released Feb 10, 2026. | +| SvelteKit | 2.53.x | Operator UI, player PWA, admin dashboard | Single codebase for SPA, SSR, and PWA modes. SvelteKit 2 + Svelte 5 runes are production-stable. Adapter-static for Go embed, adapter-node for standalone. | +| Svelte | 5.53.x | Frontend framework | Runes reactivity model ($state, $derived, $effect) handles high-frequency real-time data (100ms tournament clock) without store complexity. Svelte 5 stable since Oct 2024. | +| NATS Server | 2.12.4 | Embedded message broker on Leaf; clustered on Core | Embeds directly into Go binary (~10MB RAM overhead), JetStream provides offline-durable queuing, ordered replay on reconnect, KV store. ARM64 native packages available. | +| LibSQL (go-libsql) | unreleased / CGO | Embedded SQLite-compatible DB on Leaf | SQLite-compatible with built-in replication support. Supports linux/arm64 natively via precompiled binaries. CGO_ENABLED=1 required. | +| PostgreSQL | 16 | Relational DB on Core | Standard choice for Core; multi-tenant RLS, full-text search for player lookup, proven at scale. LibSQL mirrors for sync path. | +| Tailwind CSS | 4.x | UI styling | v4 uses Vite plugin (no PostCSS config needed), 100x faster incremental builds, CSS-native config. Pairs naturally with SvelteKit's Vite build pipeline. | +| Netbird | latest | WireGuard mesh overlay network | Self-hosted, provides mesh VPN + reverse proxy + DNS + SSH + firewall policies in one platform. Zero-config peer connection through NAT. ARM64 client supported. | +| Authentik | 2026.2.x | Self-hosted OIDC Identity Provider | Integrates natively with Netbird self-hosted. Provides SSO for operator login, LDAP fallback, Apache 2.0. Requires PostgreSQL + Redis; runs in LXC on Core. | + +### Supporting Libraries — Go Backend + +| Library | Version | Purpose | When to Use | +|---------|---------|---------|-------------| +| github.com/go-chi/chi/v5 | v5.2.5 | HTTP router | All Leaf and Core HTTP handlers. Lightweight, fully net/http compatible, composable middleware, no magic. | +| github.com/nats-io/nats.go | v1.49.0 | NATS client (JetStream API) | Publishing, consuming, and managing JetStream streams from application code. Uses the new jetstream sub-package API. | +| github.com/nats-io/nats-server/v2 | v2.12.4 | Embedded NATS server | Leaf embeds this directly via server.NewServer() + EnableJetStream(). Not used on Core (standalone server process). | +| github.com/pressly/goose/v3 | v3.27.0 | Database migrations | Runs schema migrations at startup via embed.FS. Supports SQLite + PostgreSQL with same migration files. | +| github.com/sqlc-dev/sqlc | v1.30.0 | Type-safe SQL code generation | Generate Go structs and query functions from raw SQL. Eliminates ORM overhead, keeps SQL as SQL. | +| github.com/coder/websocket | v1.8.14 | WebSocket server | Real-time push to operator UI and player PWA. Actively maintained successor to nhooyr/websocket. Context-aware, zero-allocation. | +| github.com/golang-jwt/jwt/v5 | latest | JWT token handling | Offline PIN-based auth on Leaf (no network dependency). Validates tokens from Authentik OIDC on Core. | +| go.opentelemetry.io/otel | 1.x | Observability | Structured tracing for state machine transitions, tournament operations. Add otelchi for per-request span creation. | +| github.com/riandyrn/otelchi | latest | OpenTelemetry middleware for chi | Automatic HTTP span creation. Plug into chi middleware chain. | + +### Supporting Libraries — SvelteKit Frontend + +| Library | Version | Purpose | When to Use | +|---------|---------|---------|-------------| +| @vite-pwa/sveltekit | latest | PWA + service worker | Player PWA offline caching, installable shell. Wraps Workbox. Required for offline player access. | +| vite-plugin-pwa | latest | PWA build tooling | Underlying PWA config for manifest, service worker generation. | +| @tailwindcss/vite | 4.x | Tailwind v4 Vite integration | Add before sveltekit() in vite.config. CSS-native config via app.css @import "tailwindcss". | +| svelte-sonner | latest | Toast notifications | Operator action feedback (seat assignments, registration, bust-outs). Lightweight, accessible. | +| @lucide/svelte | latest | Icon set | Consistent iconography. Tree-shakeable, Svelte-native bindings. | + +### Development Tools + +| Tool | Purpose | Notes | +|------|---------|-------| +| Task (Taskfile) | Build orchestration | Define `build:leaf`, `build:core`, `build:frontend`, `cross-compile:arm64` tasks. Replaces Makefile with YAML syntax. | +| Docker buildx | ARM64 cross-compilation | For CGO-enabled builds targeting linux/arm64. Use `--platform linux/arm64` with aarch64-linux-gnu-gcc cross-compiler in container. | +| air | Go live reload (dev) | `github.com/air-verse/air` — watches Go files, rebuilds on change. Dev only. | +| golangci-lint | Go linting | Runs multiple linters. Critical for enforcing error handling patterns in state machine code. | +| playwright | E2E testing | Test operator UI flows. Svelte-compatible. | +| sqlc | SQL → Go codegen | Run as part of build pipeline. Check generated files into git. | + +--- + +## Installation + +### Go Backend + +```bash +# Initialize module +go mod init felt + +# Core router and middleware +go get github.com/go-chi/chi/v5@v5.2.5 +go get github.com/go-chi/cors + +# NATS (client + embedded server) +go get github.com/nats-io/nats.go@v1.49.0 +go get github.com/nats-io/nats-server/v2@v2.12.4 + +# Database — LibSQL (CGO required) +# Note: CGO_ENABLED=1 and linux/arm64 cross-compiler required for Leaf builds +go get github.com/tursodatabase/go-libsql + +# Database — PostgreSQL for Core +go get github.com/jackc/pgx/v5 + +# Migrations and query gen +go get github.com/pressly/goose/v3@v3.27.0 +# sqlc: install as tool +go install github.com/sqlc-dev/sqlc/cmd/sqlc@v1.30.0 + +# Auth +go get github.com/golang-jwt/jwt/v5 + +# WebSocket +go get github.com/coder/websocket@v1.8.14 + +# Observability +go get go.opentelemetry.io/otel +go get go.opentelemetry.io/otel/sdk +go get github.com/riandyrn/otelchi +``` + +### SvelteKit Frontend + +```bash +# Scaffold +npx sv create felt-frontend +# Choose: SvelteKit, TypeScript, Tailwind + +# Or manually: +npm create svelte@latest felt-frontend + +# Tailwind v4 +npm install tailwindcss @tailwindcss/vite + +# PWA +npm install -D @vite-pwa/sveltekit vite-plugin-pwa + +# UI libraries +npm install svelte-sonner @lucide/svelte +``` + +--- + +## Alternatives Considered + +| Recommended | Alternative | When to Use Alternative | +|-------------|-------------|-------------------------| +| chi (router) | Gin, Echo, Fiber | Gin if you want more batteries-included middleware and faster onboarding. Fiber if raw throughput benchmarks matter more than stdlib compatibility. Chi chosen here because it's pure net/http, no magic, easy to embed with NATS server in same binary. | +| LibSQL (go-libsql) | mattn/go-sqlite3 | go-sqlite3 if you never need replication or remote sync. go-libsql is SQLite-compatible but adds replication capability needed for Leaf→Core sync path. | +| LibSQL (go-libsql) | modernc.org/sqlite | modernc if CGO is unacceptable (pure Go, no cross-compile issues). Tradeoff: no replication, pure-Go performance is slower, and you lose LibSQL's sync protocol. | +| goose | golang-migrate | golang-migrate is fine but goose has cleaner embed.FS support and the sqlc community uses it as the reference migration tool. | +| coder/websocket | gorilla/websocket | gorilla/websocket if you need RFC 7455 edge cases or have existing gorilla-dependent code. gorilla is widely used but coder/websocket is the modern, context-aware successor. | +| NATS JetStream | Redis Streams, RabbitMQ | Redis Streams if you already have Redis in the stack. RabbitMQ for complex enterprise routing. NATS chosen because it embeds in the Go binary (no separate process on Leaf), runs on 10MB RAM, and handles offline-queued replay natively. | +| Authentik | Keycloak, Authelia | Keycloak if you need SAML federation or very large enterprise deployments. Authelia if you want lightweight forward-auth only. Authentik chosen for OIDC depth, active development, and documented Netbird integration. | +| Tailwind CSS v4 | UnoCSS, vanilla CSS | UnoCSS if Tailwind v4's Rust engine still has edge-case gaps in your toolchain. Vanilla CSS with CSS custom properties if the design system is simple. Tailwind v4 chosen for its systematic utility classes matching the dense information design requirement. | +| SvelteKit | React/Next.js, Vue/Nuxt | React/Next.js if you have an existing team with React expertise. SvelteKit chosen for smaller bundle sizes (critical for Pi Zero display nodes loading over WiFi), built-in PWA path, and Svelte 5 runes handling high-frequency clock updates without virtual DOM overhead. | + +--- + +## What NOT to Use + +| Avoid | Why | Use Instead | +|-------|-----|-------------| +| gorilla/websocket (new code) | Unmaintained since 2023, no context support, no concurrent writes | github.com/coder/websocket (maintained successor) | +| gorm | ORM magic hides SQL, bad for complex tournament state queries, generates inefficient queries, fights with LibSQL's CGO interface | sqlc for generated type-safe queries, raw database/sql when needed | +| modernc.org/sqlite for Leaf | Pure-Go SQLite has no replication — you lose the LibSQL sync protocol that enables Leaf→Core data replication | tursodatabase/go-libsql (CGO, linux/arm64 prebuilt) | +| React/Next.js | Heavy bundle — Pi Zero 2W (512MB RAM) running Chromium kiosk will struggle; Svelte compiles to vanilla JS with no runtime | SvelteKit + Svelte 5 | +| Svelte 4 / SvelteKit 1 | End of active development; Svelte 5 runes are the current API; v4 stores pattern has known SSR shared-state bugs | Svelte 5 + SvelteKit 2 | +| Tailwind CSS v3 | Requires PostCSS config, slower builds, JS-based config. v4 drops all of this and integrates cleaner with Vite | Tailwind CSS v4 with @tailwindcss/vite plugin | +| OIDC-only auth on Leaf | If Core/internet is down, OIDC token validation fails → operators locked out | JWT-based offline PIN auth on Leaf, OIDC only when online | +| Global Netbird cloud | Introduces Netbird as a third-party MITM for network control plane | Self-hosted Netbird management server on Core (Hetzner Proxmox LXC) | +| Docker on Leaf (ARM64) | Docker daemon adds ~100MB RAM overhead on a 4-8GB device; unnecessary abstraction for a dedicated appliance | Bare systemd services; Go single binary + SvelteKit embedded build | + +--- + +## Stack Patterns by Variant + +**For Leaf Node (ARM64 SBC, offline-first):** +- Go binary with embedded NATS server + JetStream store on NVMe +- LibSQL (go-libsql, CGO) with goose migrations at startup +- SvelteKit build embedded via `//go:embed all:build` in Go binary +- Single systemd service: `felt-leaf` +- Offline PIN auth via JWT; Authentik OIDC optional when online +- CGO cross-compile: `GOOS=linux GOARCH=arm64 CGO_ENABLED=1 CC=aarch64-linux-gnu-gcc go build` + +**For Core Node (Hetzner Proxmox LXC, always-online):** +- Go binary (no embedded NATS — connects to standalone NATS cluster) +- PostgreSQL (pgx/v5 driver) +- NATS JetStream cluster for multi-venue message routing +- Authentik + Netbird management server as separate LXC containers +- Standard AMD64 build: `GOOS=linux GOARCH=amd64 go build` + +**For Display Nodes (Pi Zero 2W, Chromium kiosk):** +- No custom software on the node itself +- Raspberry Pi OS Lite + X11 + Chromium in `--kiosk` mode +- Chromium points to `http://leaf.local/display/{node-id}` (served by Leaf Go binary) +- Display content is pure SvelteKit SPA pages served from the Leaf +- Node management (which view to show) handled via NATS KV on Leaf + +**If CGO cross-compilation proves painful in CI:** +- Use Docker buildx with `FROM --platform=linux/arm64` base and native arm64 runner +- OR accept CGO complexity and use `aarch64-linux-gnu-gcc` in a standard amd64 CI runner +- Do NOT switch to modernc.org/sqlite — losing replication is worse than cross-compile friction + +--- + +## Version Compatibility + +| Package | Compatible With | Notes | +|---------|-----------------|-------| +| go-libsql (unreleased) | Go 1.26, linux/arm64 | No tagged releases; pin to commit hash in go.mod. CGO_ENABLED=1 mandatory. | +| @vite-pwa/sveltekit | SvelteKit 2.x, Vite 5.x | From v0.3.0+ supports SvelteKit 2. | +| goose v3.27.0 | Go 1.25+ | Requires Go 1.25 minimum per v3.27.0 release notes. Go 1.26 is compatible. | +| chi v5.2.5 | Go 1.22+ | Standard net/http; any Go 1.22+ supported. | +| Tailwind v4 | Vite 5.x, SvelteKit 2.x | @tailwindcss/vite must be listed before sveltekit() in vite.config plugins array. | +| NATS server v2.12.4 | nats.go v1.49.0 | Use matching client and server versions. Server v2.12.x is compatible with client v1.49.x. | +| Svelte 5.53.x | SvelteKit 2.53.x | Must use matching Svelte 5 + SvelteKit 2. Svelte 4 / SvelteKit 1 are not compatible targets. | + +--- + +## Critical Build Notes + +### LibSQL CGO Cross-Compilation + +go-libsql has no tagged releases on GitHub. You must pin to a specific commit: + +```bash +go get github.com/tursodatabase/go-libsql@ +``` + +Cross-compilation for ARM64 requires the GNU cross-compiler toolchain: + +```bash +# On Ubuntu/Debian CI +apt-get install gcc-aarch64-linux-gnu + +# Build command +GOOS=linux GOARCH=arm64 CGO_ENABLED=1 \ + CC=aarch64-linux-gnu-gcc \ + go build -o felt-leaf ./cmd/leaf +``` + +Alternative (recommended for CI consistency): Use Docker buildx with an ARM64 base image to build natively, avoiding cross-compiler dependency management. + +### SvelteKit Static Embed in Go Binary + +```go +//go:embed all:frontend/build +var frontendFS embed.FS + +// In HTTP handler: +sub, _ := fs.Sub(frontendFS, "frontend/build") +http.FileServer(http.FS(sub)) +``` + +SvelteKit must be built with `@sveltejs/adapter-static` for full embed. The Leaf serves all frontend assets from its single binary — no separate static file server. + +### NATS Embedded Server Setup + +```go +opts := &server.Options{ + Port: 4222, + Host: "127.0.0.1", + JetStream: true, + StoreDir: "/data/nats/jetstream", // on NVMe +} +ns, err := server.NewServer(opts) +ns.Start() +nc, _ := nats.Connect(ns.ClientURL()) +js, _ := jetstream.New(nc) +``` + +--- + +## Sources + +- Go 1.26 release: https://go.dev/blog/go1.26 — HIGH confidence +- NATS Server v2.12.4: https://github.com/nats-io/nats-server/releases — HIGH confidence (official) +- NATS server embed API: https://pkg.go.dev/github.com/nats-io/nats-server/v2/server — HIGH confidence +- nats.go v1.49.0: https://github.com/nats-io/nats.go/releases — HIGH confidence +- go-libsql ARM64 support: https://github.com/tursodatabase/go-libsql — HIGH confidence (official repo) +- chi v5.2.5: https://github.com/go-chi/chi/tree/v5.2.3 + pkg.go.dev — HIGH confidence +- goose v3.27.0: https://pkg.go.dev/github.com/pressly/goose/v3 + releases — HIGH confidence +- sqlc v1.30.0: https://github.com/sqlc-dev/sqlc/releases — HIGH confidence +- coder/websocket v1.8.14: https://github.com/coder/websocket — HIGH confidence +- SvelteKit 2.53.2: https://www.npmjs.com/package/@sveltejs/kit — HIGH confidence +- Svelte 5.53.5: https://www.npmjs.com/package/svelte — HIGH confidence +- Tailwind v4 Vite integration: https://tailwindcss.com/docs/guides/sveltekit — HIGH confidence +- @vite-pwa/sveltekit: https://github.com/vite-pwa/sveltekit — MEDIUM confidence (version not pinned) +- Authentik Netbird integration: https://docs.netbird.io/selfhosted/identity-providers/authentik — MEDIUM confidence +- CGO ARM64 cross-compilation: https://forum.golangbridge.org/t/cross-compiling-go-with-cgo-for-arm64/38794 — MEDIUM confidence (community) +- Go embed + SvelteKit: https://www.liip.ch/en/blog/embed-sveltekit-into-a-go-binary — MEDIUM confidence (verified pattern, widely cited) +- Pi Zero 2W Chromium kiosk: https://gist.github.com/lellky/673d84260dfa26fa9b57287e0f67d09e — MEDIUM confidence + +--- + +*Stack research for: Felt — Edge-cloud poker venue management platform* +*Researched: 2026-02-28* diff --git a/.planning/research/SUMMARY.md b/.planning/research/SUMMARY.md new file mode 100644 index 0000000..b4357f1 --- /dev/null +++ b/.planning/research/SUMMARY.md @@ -0,0 +1,263 @@ +# Project Research Summary + +**Project:** Felt — Edge-cloud poker venue management platform +**Domain:** Live venue poker operations (ARM64 SBC + cloud hybrid, offline-first) +**Researched:** 2026-02-28 +**Confidence:** MEDIUM-HIGH + +## Executive Summary + +Felt is a three-tier edge-cloud platform for managing live poker venues: tournament operations, cash game management, player tracking, and digital display signage. The competitive landscape (TDD, Blind Valet, BravoPokerLive, LetsPoker) reveals a clear gap — no single product combines offline-first reliability, wireless display management, cloud sync, and modern UX. Experts in this domain build tournament state machines as pure functions with an append-only event log backing financial calculations, and treat offline operation as a first-class architectural constraint rather than a fallback. The recommended approach is a Go monorepo with shared domain logic compiled to two targets: an ARM64 Leaf binary for venue hardware and an amd64 Core binary for the cloud tier, connected via NATS JetStream leaf-node mirroring over a WireGuard mesh. + +The primary technical risk is the intersection of offline-first requirements with financial correctness. Financial mutations (buy-ins, rebuys, prize pool splits) must be modelled as an immutable append-only event log using integer arithmetic — not mutable rows or floating-point values. Any deviation from this is not recoverable post-production without manual audit and player agreement. The second major risk is CGO cross-compilation complexity introduced by the LibSQL Go driver; this must be validated in CI from day one. A third risk is NATS JetStream's default `sync_interval` which does not guarantee durability on power loss — requiring an explicit configuration override before any production deployment. + +The architecture is well-validated: Go's single-binary embed model (SvelteKit built assets embedded via `go:embed`) eliminates deployment complexity on ARM hardware; NATS JetStream's leaf-node domain isolation provides clean offline queuing with replay; PostgreSQL RLS provides multi-tenant isolation on Core; Pi Zero 2W display nodes are stateless Chromium kiosk consumers, not managed agents. The main uncertainty is around LibSQL's go-libsql driver (no tagged releases, pinned to commit hash) and the Netbird reverse proxy beta status, both of which require early integration testing to validate before committing to downstream features. + +## Key Findings + +### Recommended Stack + +The stack is a Go + SvelteKit monorepo targeting two runtimes. Go 1.26 provides single-binary deployment, native ARM64 cross-compilation, and goroutine-based concurrency for real-time tournament clock management. SvelteKit 2 + Svelte 5 runes handle all frontends (operator UI, player PWA, display views, public pages) with Svelte 5's runes reactivity model handling 100ms clock updates without virtual DOM overhead. NATS Server 2.12.4 is embedded in the Leaf binary (~10MB RAM) and runs as a standalone cluster on Core; JetStream provides durable event queuing with offline store-and-forward replay. LibSQL (go-libsql, CGO-required) is the embedded database on Leaf; PostgreSQL 16 with row-level security is the multi-tenant store on Core. Netbird + Authentik provide self-hosted WireGuard mesh networking and OIDC identity. + +**Core technologies:** +- Go 1.26: Backend runtime for both Leaf and Core — single binary, ARM64 native, goroutine concurrency for real-time state +- SvelteKit 2 + Svelte 5: All frontends — operator UI, player PWA, display views, public pages served from embedded Go binary +- NATS JetStream 2.12.4: Embedded message broker on Leaf + hub cluster on Core — durable offline-first event sync with store-and-forward replay +- LibSQL (go-libsql): Embedded SQLite-compatible DB on Leaf — offline-first authoritative state store with WAL mode +- PostgreSQL 16: Multi-tenant relational store on Core — RLS tenant isolation, cross-venue aggregation, league management +- Netbird (WireGuard mesh): Zero-config peer networking — reverse proxy for player PWA external access, encrypted tunnel for Core sync +- Authentik: Self-hosted OIDC identity provider — operator SSO with offline PIN fallback, integrates with Netbird + +**Critical version notes:** +- go-libsql has no tagged releases; pin to commit hash in go.mod +- NATS server v2.12.4 must match nats.go v1.49.0 client +- Tailwind CSS v4 requires `@tailwindcss/vite` plugin listed before `sveltekit()` in vite.config +- CGO_ENABLED=1 required for LibSQL; ARM64 cross-compilation needs `aarch64-linux-gnu-gcc` + +### Expected Features + +The MVP must replace TDD (The Tournament Director) and a whiteboard for a single-venue operator running a complete tournament. Competitive analysis confirms no product combines offline reliability with wireless displays and cloud sync — this is the primary differentiation window. + +**Must have (table stakes) — P1:** +- Tournament clock engine (countdown, levels, breaks, pause/resume) — the core product job +- Configurable blind structures with presets, antes, chip-up messaging, and break config +- Player registration, bust-out tracking, rebuy/add-on/late registration handling +- Prize pool calculation and payout structure (ICM, fixed %, custom splits) +- Table seating assignment (random + manual) and automated table balancing +- Display system: clock view and seating view on dedicated screens (wireless) +- Player mobile PWA: live clock, blinds, personal rank — QR code access, no install +- Offline-first operation — zero internet dependency during tournament +- Role-based auth: operator PIN offline, OIDC online +- TDD data import: blind structures and player database +- Data export: CSV, JSON, HTML print output + +**Should have (competitive advantage) — P2:** +- Cash game: waitlist management, table status board, session/rake tracking, must-move logic +- Events engine: rule-based automation ("level advance → play sound, change display view") +- League/season management with configurable point formulas +- Hand-for-hand bubble mode per TDA rules +- Dealer tablet module for bust-outs and rebuys at the table +- Seat-available notifications (push/SMS) +- Digital signage content system with playlist scheduling + +**Defer (v2+) — P3:** +- WYSIWYG content editor and AI promo generation +- Dealer management and staff scheduling +- Player loyalty/points system +- Public venue presence page with online event registration +- Analytics dashboards (revenue, retention, game popularity) +- Native iOS/Android apps — PWA covers use case until then +- Cross-venue player leaderboards — requires network effect + +**Anti-features (do not build):** +- Payment processing or chip cashout — PCI/gambling license complexity +- Online poker gameplay — different product entirely +- BYO hardware support — undermines offline guarantees, support burden +- Real-money gambling regulation features — jurisdiction-specific legal maintenance + +### Architecture Approach + +The architecture is a three-tier model: Cloud Core (Hetzner Proxmox LXC, amd64), Edge Leaf (Orange Pi 5 Plus ARM64 SBC, NVMe), and Display Tier (Raspberry Pi Zero 2W Chromium kiosk). The Leaf node is the authoritative operational unit — all tournament writes happen to LibSQL first, domain events are published to a local NATS JetStream stream, and the Hub broadcasts state deltas to all WebSocket clients within ~10ms. NATS mirrors the local stream to Core asynchronously over WireGuard, providing offline store-and-forward sync with per-subject ordering guarantees. Core is an analytics/aggregation/cross-venue target only — never a write-path dependency. + +**Major components:** +1. Go Leaf API — tournament engine (pure domain logic, no I/O), financial engine, seating engine, WebSocket hub, REST/WS API; single ARM64 binary with embedded NATS + LibSQL + SvelteKit assets +2. Go Core API — multi-tenant aggregation, cross-venue leagues, player platform identity; PostgreSQL with RLS, NATS hub cluster +3. NATS JetStream (Leaf → Core) — leaf-node domain isolation, store-and-forward mirroring, append-only event audit log; doubles as sync mechanism and audit trail +4. WebSocket Hub — per-client goroutine send channels, central broadcast channel; non-blocking drop for slow clients; in-process pub/sub triggers +5. Display Tier — stateless Pi Zero 2W kiosk nodes; view assignment stored on Leaf (not on node); Chromium kiosk subscribes to assigned view URL via WebSocket; operator reassigns view through Hub broadcast +6. Netbird Mesh — WireGuard peer-to-peer tunnels; reverse proxy for player PWA HTTPS access; Authentik OIDC for operator auth when online + +**Key architecture rules:** +- Leaf is always the authoritative write target; Core is read/aggregation only +- Financial events are immutable append-only log; prize pool is derived, never stored +- All monetary values stored as int64 cents — never float64 +- Display nodes are stateless; view assignment is Leaf state, not node state +- Single Go monorepo with shared `internal/` packages; `cmd/leaf` and `cmd/core` are the only divergence points + +### Critical Pitfalls + +1. **NATS JetStream default fsync causes data loss on power failure** — Set `sync_interval: always` in the embedded NATS server config before first production deploy. The December 2025 Jepsen analysis confirmed NATS 2.12.1 loses acknowledged writes with default settings. Tournament events are low-frequency so the throughput tradeoff is irrelevant. + +2. **Float64 arithmetic corrupts prize pool calculations** — Store and compute all monetary values as `int64` cents. Percentage payouts use integer multiplication-then-division. Write a CI gate test: sum of individual payouts must equal prize pool total. This is a zero-compromise constraint — floating-point errors in production require manual audit and player agreement to resolve. + +3. **LibSQL WAL checkpoint stall causes clock drift** — Disable autocheckpoint (`PRAGMA wal_autocheckpoint = 0`), schedule explicit `PRAGMA wal_checkpoint(TRUNCATE)` during level transitions and breaks. Set `journal_size_limit` to 64MB. Gate LibSQL cloud sync behind a mutex with the write path — never overlap sync with an active write transaction. + +4. **Pi Zero 2W memory exhaustion crashes display mid-tournament** — Test ALL display views on actual Pi Zero 2W hardware from day one, not a Pi 4. Enable zram (~200MB effective headroom). Set `--max-old-space-size=256` Chromium flag. Use systemd `MemoryMax=450M` with `Restart=always`. Consider Server-Sent Events instead of WebSocket for display-only views to reduce connection overhead. + +5. **Table rebalancing algorithm produces invalid seating** — Implement rebalancing as a pure function returning a proposed move list, never auto-applied. Consult Poker TDA Rules 25-28. Unit test exhaustively: last 2-player table, dealer button position, player in blind position. Require operator tap-to-confirm before executing any move. Never silently apply balance suggestions. + +6. **PostgreSQL RLS tenant isolation bleeds between connections** — Always use `SET LOCAL app.tenant_id = $1` (transaction-scoped), never session-scoped `SET`. Assert `current_setting('app.tenant_id')` matches JWT claim before every query. Write integration tests verifying venue A token returns 0 results (not 403) for venue B endpoints. + +7. **Offline sync conflict corrupts financial ledger** — Financial events must be immutable with monotonic sequence numbers assigned by the Leaf. Never use wall-clock timestamps for event ordering (SBC clock drift is real). Surface genuine conflicts in the operator UI rather than auto-merging. + +## Implications for Roadmap + +Research strongly supports the five-phase build order identified in ARCHITECTURE.md, with one critical addition: financial engine correctness and infrastructure hardening must be established before any frontend work begins. + +### Phase 1: Foundation — Leaf Core Infrastructure + +**Rationale:** The Leaf node is the architectural foundation everything else depends on. Tournament engine, financial engine, and data layer correctness must be established in isolation before adding network layers or frontends. All seven Phase 1 critical pitfalls manifest here: NATS fsync, float arithmetic, WAL configuration, Pi Zero 2W memory, seating algorithm, RLS isolation, and offline sync conflict handling. None of these are retrofittable. + +**Delivers:** A working offline tournament system accessible via API. Operators can run a complete tournament (registration → clock → rebuys → bust-outs → payout) without any frontend UI, verifiable via API calls and automated tests. + +**Addresses (from FEATURES.md P1):** Tournament clock engine, blind structure config, player registration + bust-out, rebuy/add-on, prize pool calculation, table seating + balancing, financial engine, role-based auth, offline operation + +**Avoids:** NATS default fsync data loss, float arithmetic in financials, WAL checkpoint stall, offline sync financial conflict, table rebalancing invalid seating, Multi-tenant RLS data leak + +**Needs deeper research:** CGO cross-compilation pipeline (LibSQL ARM64 build); NATS JetStream embedded server wiring with domain isolation; go-libsql commit-pin strategy given no tagged releases + +### Phase 2: Operator + Display Frontend + +**Rationale:** The API from Phase 1 is the source of truth; frontend is a view layer. Building frontend after backend eliminates the common mistake of letting UI design drive data model decisions. Display node architecture (stateless Chromium kiosk) must be validated on actual Pi Zero 2W hardware before building more display views. + +**Delivers:** Fully operational venue management UI — operators can run a tournament through the SvelteKit operator interface; display nodes show clock/seating on TV screens; player PWA shows live data via QR code. + +**Addresses (from FEATURES.md P1):** Display system (clock + seating views), player mobile PWA, TDD data import, data export + +**Implements (from ARCHITECTURE.md):** SvelteKit operator UI, display view routing via URL parameters + WebSocket view-change broadcast, player PWA with service worker, Netbird reverse proxy for external player access + +**Avoids:** Pi Zero 2W memory exhaustion (validate on hardware before adding more views), PWA stale service worker (configure skipWaiting from day one), WebSocket state desync on server restart + +**Standard patterns:** SvelteKit + Svelte 5 runes, Tailwind v4 Vite integration, vite-pwa/sveltekit plugin — well-documented, skip deep research here + +### Phase 3: Cloud Sync + Core Backend + +**Rationale:** Core is a progressive enhancement — Leaf operates completely without it. This deliberate ordering ensures offline-first is proven before adding the cloud dependency. Multi-tenant RLS and NATS hub cluster configuration are complex enough to warrant dedicated implementation phase after Leaf is battle-tested. + +**Delivers:** Leaf events mirror to Core PostgreSQL; multi-venue operator dashboard; player platform identity (player belongs to Felt, not just one venue); cross-venue league standings computable from aggregated data. + +**Addresses (from FEATURES.md P1/P2):** TDD import to cloud, league/season management foundations, analytics data pipeline, multi-tenant operator accounts + +**Implements (from ARCHITECTURE.md):** PostgreSQL schema + RLS, NATS hub cluster, Leaf-to-Core JetStream mirror stream, Go Core API, SvelteKit SSR public pages + admin dashboard + +**Avoids:** Multi-tenant RLS tenant isolation bleed, Netbird management server as single point of failure (design redundancy here), NATS data loss on Leaf reconnect (verify replay correctness) + +**Needs deeper research:** NATS JetStream stream source/mirror configuration across domains; PostgreSQL RLS with pgx connection pool (transaction mode vs session mode); Authentik OIDC integration with Netbird self-hosted + +### Phase 4: Cash Game + Advanced Tournament Features + +**Rationale:** Cash game operations have different state machine characteristics than tournaments (open-ended sessions, waitlist progression, must-move table logic). Building after tournament proves the event-sourcing and WebSocket broadcast patterns. Events engine automation layer is additive on top of existing state machines. + +**Delivers:** Full cash game venue management: waitlist, table status board, session/rake tracking, must-move logic, seat-available notifications. Events engine enables rule-based automation for both tournament and cash game operations. + +**Addresses (from FEATURES.md P2):** Waitlist management, table status board, session tracking, rake tracking, seat-available notifications, must-move table logic, events engine, hand-for-hand mode, dealer tablet module, digital signage content system + +**Avoids:** Must-move algorithm correctness (similar testing discipline as table rebalancing), GDPR consent capture for player contact details used in push notifications + +**Needs deeper research:** Push notification delivery for seat-available (PWA push vs SMS gateway); must-move table priority queue algorithm per TDA rules; digital signage content scheduling architecture + +### Phase 5: Platform Maturity + Analytics + +**Rationale:** Platform-level features (public venue pages, cross-venue leaderboards, loyalty system, analytics dashboards) require the player identity foundation from Phase 3 and the full event history from Phases 1-4. Analytics consumers on Core event streams can be added without modifying existing Leaf or Core operational code. + +**Delivers:** Public venue discovery pages, online event registration, player loyalty/points system, revenue analytics dashboards, full GDPR compliance workflow including right-to-erasure API. + +**Addresses (from FEATURES.md P3):** WYSIWYG content editor, AI promo generation, dealer management, loyalty points, public venue presence, analytics dashboards, full GDPR compliance + +**Avoids:** Full GDPR right-to-erasure implementation (PII anonymization without destroying tournament results), cross-tenant leaderboard data isolation + +**Standard patterns:** Analytics dashboards on time-series data — well-documented patterns; skip deep research unless using TimescaleDB/ClickHouse + +### Phase Ordering Rationale + +- Financial engine correctness, NATS durability, and data layer configuration are all Phase 1 because they are architectural constraints that cannot be retrofitted without full data migration or manual audit +- Frontend follows backend (Phase 2 after Phase 1) to prevent UI from driving data model decisions; the API contract is established before the first pixel is rendered +- Core cloud sync (Phase 3) is explicitly deferred until Leaf is proven in offline operation — this validates the most important product constraint before adding complexity +- Cash game (Phase 4) shares infrastructure with tournaments but has distinct operational semantics; building after tournament prevents premature abstraction +- GDPR compliance is split: the anonymization data model must be in place from Phase 1 (player management), but the full workflow (consent capture, deletion API, retention enforcement) is Phase 5 + +### Research Flags + +Phases likely needing `/gsd:research-phase` during planning: + +- **Phase 1 — CGO cross-compilation pipeline:** go-libsql has no tagged releases and CGO ARM64 cross-compilation is a known complexity point. Need to validate the specific commit hash strategy and Docker buildx vs cross-compiler approach before committing to the build pipeline. +- **Phase 1 — NATS embedded leaf node domain setup:** The exact configuration for running an embedded NATS server as a JetStream leaf node with a named domain, connecting to a Core hub, is documented but has known gotchas (domain naming conflicts, account configuration). Validate with a minimal integration test before building any domain event logic on top. +- **Phase 3 — NATS JetStream stream source/mirror across domains:** Cross-domain JetStream mirroring has specific configuration requirements. The Core side creating a stream source from a leaf domain is not well-documented outside official NATS docs. Needs validation test. +- **Phase 3 — Netbird reverse proxy beta status:** Netbird reverse proxy is in beta as of research date. The integration with Traefik as external reverse proxy needs explicit validation. Test before committing the player PWA access pattern to this mechanism. +- **Phase 4 — Push notification delivery for seat-available:** PWA push requires browser permission grant and a push service (Web Push Protocol, VAPID keys). SMS requires a gateway (Twilio, Vonage). Neither is trivial and the choice has cost and compliance implications. + +Phases with standard patterns (skip research-phase): + +- **Phase 2 — SvelteKit + Tailwind v4 + vite-pwa:** All well-documented with official guides. The integration patterns are verified and the stack is stable. Implement directly. +- **Phase 2 — WebSocket Hub pattern:** Canonical Go pattern with multiple reference implementations. Implement directly from the hub pattern documented in ARCHITECTURE.md. +- **Phase 5 — Analytics dashboards:** Standard time-series query patterns on PostgreSQL/TimescaleDB. Skip research unless introducing a dedicated analytics database. + +## Confidence Assessment + +| Area | Confidence | Notes | +|------|------------|-------| +| Stack | MEDIUM-HIGH | Core technologies (Go, SvelteKit, NATS, PostgreSQL) verified against official sources and current releases. go-libsql is the uncertainty — no tagged releases, CGO complexity. Netbird reverse proxy is beta. | +| Features | MEDIUM | Competitive analysis covered major products (TDD, Blind Valet, BravoPokerLive, LetsPoker, CasinoWare). Feature importance inferred from analysis and forum discussions, not direct operator interviews. Prioritization reflects reasonable inference, not validated PMF. | +| Architecture | MEDIUM-HIGH | NATS leaf-node patterns verified via official docs. WebSocket Hub is a canonical Go pattern. LibSQL embedded replication model verified. Pi Zero 2W constraints community-verified. Chromium kiosk approach has multiple real-world references. | +| Pitfalls | HIGH | NATS fsync data loss is documented in a December 2025 Jepsen analysis (independent, high confidence). Float arithmetic, RLS bleed, and WAL checkpoint issues are verified against official sources. Pi Zero 2W memory constraints are community-verified. Table rebalancing edge cases are documented in TDD's own changelog. | + +**Overall confidence:** MEDIUM-HIGH + +### Gaps to Address + +- **Direct operator validation:** Feature priorities are inferred from competitive analysis, not operator interviews. The first beta deployments should include structured feedback collection to validate P1 feature completeness before Phase 2 work begins. +- **go-libsql stability and replication:** The go-libsql driver has no tagged releases and the LibSQL embedded replication feature is in public beta. The sync-to-Core path may not be needed if NATS JetStream handles all replication. Validate during Phase 1 whether LibSQL sync is used at all or NATS is the exclusive sync mechanism. +- **Netbird reverse proxy in production:** Beta status means API may change. Validate the full player PWA access flow (QR code → public HTTPS URL → WireGuard → Leaf) in a real venue network environment before Phase 3 depends on it. +- **Pi Zero 2W Chromium memory with multi-view display:** Memory profiling has been community-validated for basic kiosk use, but not for the specific animation patterns in a tournament clock display. Must be validated on actual hardware in Phase 2 before scaling display views. +- **Multi-currency display configuration:** Research flagged this as deferred (display-only currency symbol), but the data model choice (storing amounts as cents in a single implicit currency vs. currency-tagged amounts) must be made in Phase 1 even if multi-currency display is deferred. + +## Sources + +### Primary (HIGH confidence) +- Go 1.26 release — https://go.dev/blog/go1.26 +- NATS Server v2.12.4 releases — https://github.com/nats-io/nats-server/releases +- NATS JetStream Core Concepts — https://docs.nats.io/nats-concepts/jetstream +- Jepsen: NATS 2.12.1 analysis — https://jepsen.io/blog/2025-12-08-nats-2.12.1 +- NATS JetStream data loss GitHub issue — https://github.com/nats-io/nats-server/issues/7564 +- SQLite Write-Ahead Logging — https://sqlite.org/wal.html +- LibSQL Embedded Replicas data corruption — https://github.com/tursodatabase/libsql/discussions/1910 +- Multi-tenant Data Isolation with PostgreSQL RLS — AWS — https://aws.amazon.com/blogs/database/multi-tenant-data-isolation-with-postgresql-row-level-security/ +- Floats Don't Work for Storing Cents — Modern Treasury +- SvelteKit 2.53.x official docs — https://kit.svelte.dev +- Tailwind v4 Vite integration — https://tailwindcss.com/docs/guides/sveltekit +- The Tournament Director known bugs changelog — https://thetournamentdirector.net/changes.txt + +### Secondary (MEDIUM confidence) +- NATS Adaptive Edge Deployment — https://docs.nats.io/nats-concepts/service_infrastructure/adaptive_edge_deployment +- JetStream on Leaf Nodes — https://docs.nats.io/running-a-nats-service/configuration/leafnodes/jetstream_leafnodes +- NetBird Reverse Proxy Docs — https://docs.netbird.io/manage/reverse-proxy +- LibSQL Embedded Replicas — https://docs.turso.tech/features/embedded-replicas/introduction +- Authentik Netbird integration — https://docs.netbird.io/selfhosted/identity-providers/authentik +- CGO ARM64 cross-compilation community thread +- Go embed + SvelteKit pattern — https://www.liip.ch/en/blog/embed-sveltekit-into-a-go-binary +- Chromium on Pi Zero 2W memory — Raspberry Pi Forums +- PostgreSQL RLS implementation guide — permit.io / AWS +- Competitor feature analysis: Blind Valet, BravoPokerLive, LetsPoker, CasinoWare, kHold'em, PokerAtlas TableCaptain +- PokerNews: Best Poker Table Management Software comparison +- Poker TDA Rules 2013 (balancing procedures Rules 25-28) + +### Tertiary (LOW confidence) +- Synadia: AI at the Edge with NATS JetStream — single source for edge AI patterns +- Multi-Tenancy Database Patterns in Go — single source, corroborates general RLS pattern +- Raspberry Pi Kiosk System community project +- NetBird 2025 critical mistakes — third-party blog, verify against official docs +- GDPR compliance for gaming operators — legal advisory blog, not authoritative + +--- +*Research completed: 2026-02-28* +*Ready for roadmap: yes*