felt/.planning/research/ARCHITECTURE.md

32 KiB

Architecture Research

Domain: Edge-cloud poker venue management platform (offline-first, three-tier) Researched: 2026-02-28 Confidence: MEDIUM-HIGH (core patterns well-established; NATS leaf node specifics verified via official docs)

Standard Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────────────────┐
│                            CLOUD TIER (Core)                                    │
│  Hetzner Dedicated — Proxmox VE — LXC Containers                                │
│                                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  Go Core API │  │  PostgreSQL  │  │NATS JetStream│  │    Authentik     │   │
│  │  (multi-     │  │  (venue      │  │  (hub cluster│  │  (OIDC IdP)      │   │
│  │   tenant)    │  │   agg. data) │  │   R=3)       │  │                  │   │
│  └──────┬───────┘  └──────────────┘  └──────┬───────┘  └──────────────────┘   │
│         │                                    │                                  │
│  ┌──────────────┐  ┌──────────────┐          │                                  │
│  │  SvelteKit   │  │   Netbird    │          │  ← mirrors from leaf streams     │
│  │  (public     │  │  (WireGuard  │          │                                  │
│  │   pages,     │  │   mesh ctrl) │          │                                  │
│  │   admin UI)  │  │              │          │                                  │
│  └──────────────┘  └──────────────┘          │                                  │
└────────────────────────────────────────────────────────────────────────────────┘
         │ WireGuard encrypted tunnel (Netbird mesh)
         │ NATS leaf node connection (domain: "leaf-<venue-id>")
         │ NetBird reverse proxy (HTTPS → WireGuard → Leaf :8080)
         ↓
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         EDGE TIER (Leaf Node)                                   │
│  ARM64 SBC — Orange Pi 5 Plus — NVMe — ~€100                                   │
│                                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  Go Leaf API │  │   LibSQL     │  │NATS JetStream│  │  SvelteKit       │   │
│  │  (tournament │  │  (embedded   │  │  (embedded   │  │  (operator UI    │   │
│  │   engine,    │  │   SQLite +   │  │   leaf node, │  │   served from    │   │
│  │   state mgr) │  │   WAL-based) │  │   local      │  │   Leaf)          │   │
│  └──────┬───────┘  └──────────────┘  │   streams)   │  └──────────────────┘   │
│         │                            └──────┬───────┘                          │
│         │ WebSocket broadcast               │ mirror stream                    │
│         │                                  ↓ (store-and-forward)              │
│  ┌──────────────┐                   to Core when online                        │
│  │  Hub Manager │                                                               │
│  │  (client     │                                                               │
│  │   registry,  │                                                               │
│  │   broadcast) │                                                               │
│  └──────────────┘                                                               │
└─────────────────────────────────────────────────────────────────────────────────┘
         │ Local WiFi / Ethernet
         │ WebSocket (ws:// — LAN only, no TLS needed)
         │ Chromium kiosk HTTP polling / WebSocket
         ↓
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         DISPLAY TIER (Display Nodes)                            │
│  Raspberry Pi Zero 2W — 512MB RAM — ~€20 each                                  │
│                                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │  Chromium    │  │  Chromium    │  │  Chromium    │  │    Chromium      │   │
│  │  Kiosk       │  │  Kiosk       │  │  Kiosk       │  │    Kiosk         │   │
│  │  (Clock      │  │  (Rankings)  │  │  (Seating)   │  │  (Signage)       │   │
│  │   view)      │  │              │  │              │  │                  │   │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────────┘   │
│  Raspberry Pi OS Lite + X.Org + Openbox + Chromium (kiosk mode, no UI chrome)  │
└─────────────────────────────────────────────────────────────────────────────────┘

Player Phones (PWA) ←→ Netbird reverse proxy ←→ Leaf Node
                        (public HTTPS URL, same URL from any network)

Component Responsibilities

Component Responsibility Implementation
Go Leaf API Tournament engine, financial engine, state machine, WebSocket hub, REST/WS API for operator + players Go binary (ARM64), goroutine-per-subsystem, embedded NATS + LibSQL
LibSQL (Leaf) Single source of truth for venue state, tournament data, player records Embedded SQLite via LibSQL driver (github.com/tursodatabase/go-libsql), WAL mode
NATS JetStream (Leaf) Local pub/sub for in-process events, durable stream for cloud sync, event audit log Embedded nats-server process, domain leaf-<venue-id>, local stream mirrored to Core
SvelteKit (from Leaf) Operator UI (admin SPA) served by Leaf, player PWA, display views Static SvelteKit build served by Go's net/http or embedded filesystem
Hub Manager (Leaf) WebSocket connection registry, broadcast state to all connected clients Go goroutines + channels; one goroutine per connection, central broadcast channel
Netbird Agent (Leaf) WireGuard tunnel to Core, reverse proxy target registration, DNS Netbird client process, auto-reconnects, handles NAT traversal via STUN/TURN
Go Core API Multi-tenant aggregation, cross-venue leagues, player identity, remote API access, cloud-hosted free tier Go binary (amd64), PostgreSQL with RLS, NATS hub cluster
PostgreSQL (Core) Persistent store for aggregated venue data, player profiles, leagues, analytics PostgreSQL 17+, RLS by venue_id, pgx driver in Go
NATS JetStream (Core) Hub cluster receiving mirrored streams from all leaves, fan-out to analytics consumers Clustered NATS (R=3), domain core, stream sources from all leaf mirrors
Authentik (Core) OIDC identity provider for Netbird and Felt operator auth, PIN login fallback Self-hosted Authentik, ~200MB RAM, Apache 2.0
Netbird Control (Core) Mesh network management plane, policy distribution, reverse proxy routing Self-hosted Netbird management + signal services
SvelteKit (Core) Public venue pages (SSR), admin dashboard, free-tier virtual Leaf UI SvelteKit with SSR for public pages, SPA for dashboard
Display Nodes Render assigned view (clock/rankings/seating/signage) in kiosk browser Pi Zero 2W + Raspberry Pi OS Lite + X.Org + Openbox + Chromium kiosk
felt/
├── cmd/
│   ├── leaf/               # Leaf Node binary entrypoint (ARM64 target)
│   │   └── main.go         # Boots LibSQL, embedded NATS, HTTP/WS server
│   └── core/               # Core binary entrypoint (amd64 target)
│       └── main.go         # Boots PostgreSQL conn, NATS hub, HTTP server
│
├── internal/
│   ├── tournament/         # Domain: tournament engine (state machine)
│   │   ├── engine.go       # Clock, blinds, levels — pure business logic
│   │   ├── financial.go    # Buy-ins, rebuys, prize pool, rake
│   │   ├── seating.go      # Table layout, auto-balance, drag-and-drop
│   │   └── events.go       # Domain events emitted on state changes
│   ├── player/             # Domain: player management
│   │   ├── registry.go     # Player database, registration, bust-out
│   │   └── identity.go     # Platform-level identity (belongs to Felt, not venue)
│   ├── display/            # Domain: display node management
│   │   ├── registry.go     # Node registration, view assignment
│   │   └── views.go        # View types: clock, rankings, seating, signage
│   ├── sync/               # NATS JetStream sync layer
│   │   ├── leaf.go         # Leaf-side: publish events, mirror config
│   │   └── core.go         # Core-side: consume from leaf mirrors, aggregate
│   ├── ws/                 # WebSocket hub
│   │   ├── hub.go          # Client registry, broadcast channel
│   │   └── handler.go      # Upgrade, read pump, write pump
│   ├── api/                # HTTP handlers (shared where possible)
│   │   ├── tournament.go
│   │   ├── player.go
│   │   └── display.go
│   ├── store/              # Data layer
│   │   ├── libsql/         # Leaf: LibSQL queries (sqlc generated)
│   │   └── postgres/       # Core: PostgreSQL queries (sqlc generated)
│   └── auth/               # Auth: PIN offline, OIDC online
│       ├── pin.go
│       └── oidc.go
│
├── frontend/               # SvelteKit applications
│   ├── operator/           # Operator UI (served from Leaf)
│   ├── player/             # Player PWA (served from Leaf)
│   ├── display/            # Display views (served from Leaf)
│   └── public/             # Public venue pages (served from Core, SSR)
│
├── schema/
│   ├── libsql/             # LibSQL migrations (goose or atlas)
│   └── postgres/           # PostgreSQL migrations (goose or atlas)
│
├── build/
│   ├── leaf/               # Dockerfile.leaf, systemd units, LUKS scripts
│   └── core/               # Dockerfile.core, LXC configs, Proxmox notes
│
└── scripts/
    ├── cross-build.sh      # GOOS=linux GOARCH=arm64 go build ./cmd/leaf
    └── provision-leaf.sh   # Flash + configure a new Leaf device

Structure Rationale

  • cmd/leaf vs cmd/core: Same internal packages, different main.go wiring. Shared domain logic compiles to both targets without duplication. GOARCH=arm64 for leaf, default for core.
  • internal/tournament/: Pure domain logic with no I/O dependencies. Testable without database or NATS.
  • internal/sync/: The bridge between domain events and NATS JetStream. Leaf publishes; Core subscribes via mirror.
  • internal/ws/: Hub pattern isolates WebSocket concerns. Goroutines for each connection; central broadcast channel prevents blocking.
  • schema/libsql vs schema/postgres: Separate migration paths because LibSQL (SQLite dialect) and PostgreSQL have syntax differences (no arrays, different types).

Architectural Patterns

Pattern 1: NATS JetStream Leaf-to-Core Domain Sync

What: Leaf node runs an embedded NATS server with its own JetStream domain (leaf-<venue-id>). All state-change events are published to a local stream. Core creates a mirror of this stream using stream source configuration. JetStream's store-and-forward guarantees delivery when the connection resumes after offline periods.

When to use: For any state that needs to survive offline periods and eventually reach Core. All tournament events, financial transactions, player registrations.

Trade-offs: At-least-once delivery means consumers must be idempotent. Message IDs on publish plus deduplication windows on Core resolve this. No ordering guarantees across subjects, but per-subject ordering is preserved.

Domain configuration (NATS server config on Leaf):

# leaf-node.conf
jetstream {
  domain: "leaf-venue-abc123"
  store_dir: "/data/nats"
}

leafnodes {
  remotes [
    {
      urls: ["nats://core.felt.internal:7422"]
      account: "$G"
    }
  ]
}

Mirror configuration (Core side — creates source from leaf domain):

// core/sync.go
js.AddStream(ctx, jetstream.StreamConfig{
    Name: "VENUE_ABC123_EVENTS",
    Sources: []*jetstream.StreamSource{
        {
            Name:          "VENUE_EVENTS",
            Domain:        "leaf-venue-abc123",
            FilterSubject: "venue.abc123.>",
        },
    },
})

Pattern 2: WebSocket Hub-and-Broadcast for Real-Time Clients

What: Central Hub struct in Go holds a map of active connections (operator UI, player PWA, display nodes). State changes trigger a broadcast to the Hub. The Hub writes to each connection's send channel. Per-connection goroutines handle read and write independently, preventing slow clients from blocking others.

When to use: Any real-time update that needs to reach all connected clients within 100ms — clock ticks, table state changes, seating updates.

Trade-offs: In-process hub is simple and fast. No Redis pub/sub needed at single-venue scale. Restart drops all connections (clients must reconnect — which is standard WebSocket behavior).

Example (Hub pattern):

type Hub struct {
    clients    map[*Client]bool
    broadcast  chan []byte
    register   chan *Client
    unregister chan *Client
}

func (h *Hub) Run() {
    for {
        select {
        case client := <-h.register:
            h.clients[client] = true
        case client := <-h.unregister:
            delete(h.clients, client)
            close(client.send)
        case message := <-h.broadcast:
            for client := range h.clients {
                select {
                case client.send <- message: // non-blocking
                default:
                    close(client.send)
                    delete(h.clients, client)
                }
            }
        }
    }
}

Pattern 3: Offline-First with Local-Writes-First

What: All writes go to LibSQL (Leaf) first, immediately confirming to the client. LibSQL write triggers a domain event published to local NATS stream. NATS mirrors the event to Core asynchronously when online. The UI subscribes to WebSocket and sees state changes from the local store — never waiting on the network.

When to use: All operational writes: starting clock, registering buy-in, busting a player, assigning a table.

Trade-offs: Core is eventually consistent with Leaf, not strongly consistent. For operational use (venue running a tournament), this is the correct trade-off — the venue never waits for the cloud. Cross-venue features (league standings) accept slight delay.

Pattern 4: Event-Sourced Audit Trail via JetStream Streams

What: NATS JetStream streams are append-only and immutable by default. Every state change (clock pause, player bust-out, financial transaction) is published as an event with sequence number and timestamp. The stream retains full history. This doubles as the sync mechanism and the audit log. Current state in LibSQL is the projection of these events.

When to use: All state changes that need an audit trail (financial transactions, player registrations, table assignments).

Trade-offs: Stream storage grows over time (limit by time or byte size for old tournaments). Projecting current state from events adds complexity on recovery — mitigate with snapshots in LibSQL. Full event history is available in Core for analytics.

Pattern 5: Display Node View Assignment via URL Parameters

What: Each Pi Zero 2W Chromium instance opens a URL like http://leaf.local:8080/display?view=clock&tournament=abc123. The Leaf serves the display SvelteKit app. The view is determined by URL parameter, set via the operator UI's node registry. Chromium kiosk mode (no UI chrome) renders full-screen. Changes to view assignment push through WebSocket, triggering client-side navigation.

When to use: All display node management — view assignment, content scheduling, emergency override.

Trade-offs: URL-based assignment is simple and stateless on the display node. Requires reliable local WiFi. Pi Zero 2W's 512MB RAM constrains complex Svelte animations; keep display views lightweight (clock, text, simple tables).

Data Flow

Tournament State Change Flow (Operator Action)

Operator touches UI (e.g., "Advance Level")
    ↓
SvelteKit → POST /api/tournament/{id}/level/advance
    ↓
Go Leaf API handler validates & applies change
    ↓
LibSQL write (authoritative local state update)
    ↓
Domain event emitted: {type: "LEVEL_ADVANCED", level: 5, ...}
    ↓
Event published to NATS subject: "venue.{id}.tournament.{id}.events"
    ↓
NATS local stream appends event (immutable audit log)
    ↓ (parallel)
Hub.broadcast ← serialized state delta (JSON)
    ↓
All WebSocket clients receive update within ~10ms
    ├── Operator UI: updates clock display
    ├── Player PWA: updates blind levels shown
    └── Display Nodes: all views react to new state
    ↓ (async, when online)
NATS mirror replicates event to Core stream
    ↓
Core consumer processes event → writes to PostgreSQL
    ↓
Aggregated data available for cross-venue analytics

Player Phone Access Flow (Online)

Player scans QR code → browser opens https://venue.felt.app/play
    ↓
DNS resolves to Core (public IP)
    ↓
NetBird reverse proxy (TLS termination at proxy)
    ↓
Encrypted WireGuard tunnel → Leaf Node :8080
    ↓
Go Leaf API serves SvelteKit PWA
    ↓
PWA opens WebSocket ws://venue.felt.app/ws (proxied via same mechanism)
    ↓
Player sees live clock, blinds, rankings, personal stats

Display Node Lifecycle

Pi Zero 2W boots → systemd starts X.Org → Openbox autostart → Chromium kiosk
    ↓
Chromium opens: http://leaf.local:8080/display?node-id=display-001
    ↓
Leaf API: lookup node-id in node registry → determine assigned view
    ↓
SvelteKit display app renders assigned view (clock / rankings / seating / signage)
    ↓
WebSocket connection held to Leaf
    ↓
When operator reassigns view → Hub broadcasts view-change event
    ↓
Display SvelteKit navigates to new view (client-side routing, no page reload)

Offline → Online Reconnect Sync

Leaf node was offline (NATS leaf connection dropped)
    ↓
Venue continues operating normally (LibSQL is authoritative, NATS local streams work)
All events accumulate in local JetStream stream (store-and-forward)
    ↓
WireGuard tunnel restored (Netbird handles auto-reconnect)
    ↓
NATS leaf node reconnects to Core hub
    ↓
JetStream mirror resumes replication from last sequence number
    ↓
Core processes accumulated events in order (per-subject ordering preserved)
    ↓
PostgreSQL updated with all events that occurred during offline period

Multi-Tenant Core Data Model

Core PostgreSQL:
  venues (id, name, netbird_peer_id, subscription_tier, ...)
  tournaments (id, venue_id, ...)       ← RLS: venue_id = current_setting('app.venue_id')
  players (id, felt_user_id, ...)       ← platform-level identity (no venue_id)
  league_standings (id, league_id, ...) ← cross-venue aggregation

Scaling Considerations

Scale Architecture Adjustments
1-50 venues (MVP) Single Core server on Hetzner; NATS single-node or simple cluster; LibSQL on each Leaf is the bottleneck-free read path
50-500 venues NATS core cluster R=3 is already the design; PostgreSQL read replicas for analytics; SvelteKit public site to CDN
500+ venues NATS super-cluster across Hetzner regions; PostgreSQL sharding by venue_id; dedicated analytics database (TimescaleDB or ClickHouse for event stream)

Scaling Priorities

  1. First bottleneck: Core NATS hub receiving mirrors from many leaves simultaneously. Mitigation: NATS is designed for this — 50M messages/sec benchmarks. Won't be the bottleneck before 500 venues.
  2. Second bottleneck: PostgreSQL write throughput as event volume grows. Mitigation: NATS stream is the durable store; Postgres writes are async. TimescaleDB for time-series event analytics defers this further.
  3. Not a bottleneck: Leaf Node WebSocket clients — 25,000+ connections on a modest server (the Leaf handles 1 venue, typically 5-50 concurrent clients).

Anti-Patterns

Anti-Pattern 1: Making Core a Write Path Dependency

What people do: Design operator actions to write to Core (cloud) first, then sync down to Leaf. Why it's wrong: The primary constraint is offline-first. If Core is the write path, any internet disruption breaks the entire operation. Do this instead: Leaf is always the authoritative write target. Core is a read/analytics/aggregation target. Never make Core an operational dependency.

Anti-Pattern 2: Shared Database Between Leaf and Core

What people do: Try to use a single LibSQL instance with remote replication as both the Leaf store and Core store. Why it's wrong: LibSQL embedded replication (Turso model) requires connectivity to the remote primary for writes. This violates offline-first. Also: Core needs PostgreSQL features (RLS, complex queries, multi-venue joins) that LibSQL cannot provide. Do this instead: Separate data stores per tier. LibSQL on Leaf (sovereign, offline-capable). PostgreSQL on Core (multi-tenant, cloud-native). NATS JetStream is the replication channel, not the database driver.

Anti-Pattern 3: Single Goroutine WebSocket Broadcast

What people do: Iterate over all connected clients in a single goroutine and write synchronously. Why it's wrong: A slow or disconnected client blocks the broadcast for all others. One stale connection delays the clock update for everyone. Do this instead: Hub pattern with per-client send channels (buffered). Use select with a default case to drop slow clients rather than block. Per-connection goroutines handle writes to the actual WebSocket.

Anti-Pattern 4: Storing View Assignment State on Display Nodes

What people do: Configure display views locally on each Pi and SSH in to change them. Why it's wrong: Requires SSH access to each device. No central management. Adding a new display means physical configuration. Breaking offline-first if central config is required at boot. Do this instead: Display nodes are stateless. They register with the Leaf by device ID (MAC or serial). Leaf holds the view assignment. Display nodes poll/subscribe for their assignment. Swap physical Pi without reconfiguration.

Anti-Pattern 5: Separate Go Codebases for Leaf and Core

What people do: Create two independent Go repositories with duplicated domain logic. Why it's wrong: Business logic diverges over time. Bugs fixed in one aren't fixed in the other. Double maintenance burden for a solo developer. Do this instead: Single Go monorepo with shared internal/ packages. cmd/leaf/main.go and cmd/core/main.go are the only divergence points — they wire up the same packages with different configuration. GOOS=linux GOARCH=arm64 go build ./cmd/leaf for the Leaf binary.

Integration Points

External Services

Service Integration Pattern Notes
Netbird (WireGuard mesh) Agent on Leaf connects to self-hosted Netbird management service; reverse proxy configured per-venue NetBird reverse proxy is beta, requires Traefik as external reverse proxy on Core; test early
Authentik (OIDC) Leaf uses OIDC tokens from Authentik for operator login when online; PIN login as offline fallback PIN verification against locally cached hash in LibSQL; no Authentik dependency during offline operation
NATS JetStream (leaf↔core) Leaf runs embedded NATS server as leaf node connecting to Core hub over WireGuard Domain isolation per venue; subjects namespaced venue.<id>.>

Internal Boundaries

Boundary Communication Notes
Go Leaf API ↔ LibSQL Direct SQL via go-libsql driver (CGo-free driver preferred for cross-compilation) Use sqlc for type-safe query generation; avoid raw string queries
Go Leaf API ↔ NATS (local) In-process NATS client connecting to embedded server (nats.Connect("nats://127.0.0.1:4222")) Publish on every state-change event; Hub subscribes to NATS for broadcast triggers
Go Leaf API ↔ WebSocket Hub Channel-based: API handlers send to hub.broadcast channel Hub runs in its own goroutine; never call Hub methods directly from handlers
Go Core API ↔ PostgreSQL pgx/v5 driver, sqlc generated queries; RLS via SET LOCAL app.venue_id = $1 in transaction Row-level security enforced at database layer as defense-in-depth
Go Core API ↔ NATS (hub) Standard NATS client; consumers per-venue mirror stream Push consumers for real-time processing; durable consumers for reliable at-least-once
Leaf ↔ Display Nodes HTTP (serve SvelteKit app) + WebSocket (state updates) over local LAN No TLS on local LAN — Leaf and displays are on the same trusted network
Leaf ↔ Player PWA HTTP + WebSocket proxied via Netbird reverse proxy HTTPS at proxy, decrypts, sends over WireGuard to Leaf

Suggested Build Order

The build order derives from dependency relationships: each layer must be tested before the layer above it depends on it.

Phase 1: Foundation (Leaf Core + Networking)
   1a. LibSQL schema + Go data layer (sqlc queries, migrations)
   1b. Tournament engine (pure Go, no I/O — state machine logic)
   1c. NATS embedded + local event publishing
   1d. WebSocket Hub (broadcast infrastructure)
   1e. REST + WS API (operator endpoints)
   1f. Netbird agent on Leaf (WireGuard mesh)
   1g. PIN auth (offline) + OIDC auth (online fallback)
        ↓ Validates: Offline operation works end-to-end

Phase 2: Frontend Clients
   2a. SvelteKit operator UI (connects to Leaf API + WS)
   2b. SvelteKit display views (connects to Leaf WS)
   2c. Player PWA (connects to Leaf via Netbird reverse proxy)
        ↓ Validates: Real-time sync, display management, player access

Phase 3: Cloud Sync (Core)
   3a. PostgreSQL schema + RLS (multi-tenant)
   3b. NATS hub cluster on Core
   3c. Leaf-to-Core stream mirroring (event replay on reconnect)
   3d. Go Core API (multi-tenant REST, league aggregation)
   3e. SvelteKit public pages (SSR) + admin dashboard
        ↓ Validates: Offline sync, cross-venue features, eventual consistency

Phase 4: Display Management + Signage
   4a. Display node registry (Leaf API)
   4b. View assignment system (operator sets view per node)
   4c. Pi Zero 2W provisioning scripts (kiosk setup automation)
   4d. Digital signage content system + scheduler
        ↓ Validates: Wireless display management at scale

Phase 5: Authentication + Security Hardening
   5a. Authentik OIDC integration
   5b. LUKS encryption on Leaf (device-level)
   5c. NATS auth callout (per-venue account isolation)
   5d. Audit trail validation (event stream integrity checks)

Why this order:

  • Leaf foundation must exist before any frontend can connect to it
  • Tournament engine logic is the most complex domain; test it isolated before adding network layers
  • Cloud sync (Phase 3) is a progressive enhancement — the Leaf works completely without it
  • Display management (Phase 4) depends on the WebSocket infrastructure from Phase 1
  • Auth hardening (Phase 5) is last because it can wrap existing endpoints without architectural change

Sources


Architecture research for: Felt — Edge-cloud poker venue management platform Researched: 2026-02-28