felt/.planning/research/PITFALLS.md

# Pitfalls Research

**Domain:** Edge-cloud poker venue management platform (offline-first, ARM64 SBC, real-time display sync, multi-tenant SaaS)
**Researched:** 2026-02-28
**Confidence:** MEDIUM-HIGH (critical pitfalls verified against multiple sources; some domain-specific items from training data flagged)

---

## Critical Pitfalls

### Pitfall 1: NATS JetStream Data Loss Under Default fsync Settings

**What goes wrong:**
NATS JetStream's default `sync_interval` is 2 minutes — meaning acknowledged messages are not guaranteed to be on disk before the ACK is sent to the client. A kernel crash, power loss, or sudden SBC shutdown on the Leaf node can result in losing messages that were already ACK'd by the broker. The December 2025 Jepsen analysis of NATS 2.12.1 confirmed this: "NATS JetStream can lose data or get stuck in persistent split-brain in response to file corruption or simulated node failures." Even a single power failure can trigger loss of committed writes.

**Why it happens:**
The lazy fsync default prioritizes throughput over durability. This is the correct tradeoff for most messaging workloads. However, tournament state sync is financial-adjacent: a lost "player busted" or "rebuy processed" event corrupts the prize pool ledger and produces incorrect final payouts. Developers assume "acknowledged" means "durable."

**How to avoid:**
Set `sync_interval: always` on the embedded NATS server configuration on the Leaf node. This makes NATS fsync before every acknowledgement. Accept the throughput reduction — tournament events are low-frequency (< 100 events/minute) so this has zero practical impact. Verify this setting is in the embedded server config before first production deploy.

```yaml
# nats-server.conf for Leaf
jetstream {
  store_dir: /var/lib/felt/jetstream
  sync_interval: always
}
```

**Warning signs:**
- Default config template copied from NATS quickstart docs (which do not set `sync_interval`)
- Tournament events not replaying correctly after SBC reboot in testing
- "Event count mismatch" after power-cycle tests

**Phase to address:** Phase 1 (NATS JetStream integration) — bake this into the embedded server bootstrap config, not as a post-launch fix.

---

### Pitfall 2: LibSQL/SQLite WAL Checkpoint Stall Causing Write Timeouts

**What goes wrong:**
SQLite in WAL mode accumulates changes in the WAL file until a checkpoint copies them back to the main database. Under sustained write load (active tournament with frequent state updates), the WAL file can grow unbounded if readers hold long transactions — preventing checkpoint from completing. When a checkpoint finally runs (FULL or RESTART mode), it briefly blocks all writers, causing the tournament clock update goroutine to queue up and miss a 1-second tick. Separately, running LibSQL's sync operation while the local WAL is being actively written risks data corruption (documented in LibSQL issue #1910).

**Why it happens:**
Developers set WAL mode and consider concurrency "solved." They miss that: (1) WAL autocheckpoint defaults to PASSIVE mode which skips when readers are present, (2) uncontrolled WAL growth degrades read performance as SQLite must scan further into WAL history, and (3) LibSQL sync must not overlap with active write transactions.

**How to avoid:**
- Set `PRAGMA wal_autocheckpoint = 0` to disable automatic checkpointing, then schedule explicit `PRAGMA wal_checkpoint(TRUNCATE)` during quiet periods (e.g., break periods in tournament, level transitions).
- Set `PRAGMA journal_size_limit = 67108864` (64MB) to cap WAL file size.
- Never initiate LibSQL cloud sync during an active database write transaction — gate sync on a mutex with the write path.
- Use `PRAGMA busy_timeout = 5000` to avoid immediate failures when contention occurs.

**Warning signs:**
- Tournament clock drifting by 1-2 seconds during heavy rebuy periods
- WAL file growing past 50MB during long tournaments
- LibSQL sync errors logged during high-activity periods

**Phase to address:** Phase 1 (database layer setup) — configure these pragmas in the database initialization code path, not as tuning afterthoughts.

---

### Pitfall 3: Offline-First Sync Conflict Producing Incorrect Prize Pool Ledger

**What goes wrong:**
The Leaf node operates offline. If an operator on the venue tablet and a player viewing their phone PWA both initiate actions that affect the same tournament state (e.g., operator marks player as busted while player is registering a rebuy via PWA), conflicting events arrive at the NATS stream in undefined order when connectivity resumes. A Last-Write-Wins or timestamp-based merge on financial records corrupts the prize pool: a rebuy that was processed offline gets silently dropped, and the player is paid out less than they are owed.

**Why it happens:**
Developers treat all offline sync as equal. Financial ledger mutations (buy-ins, rebuys, payouts) are not idempotent by default. The LibSQL sync "last-push-wins" default is dangerous for financial records where all writes must be applied in the correct order. Timestamp-based ordering fails on SBCs where system clocks can drift.

**How to avoid:**
- Treat financial transactions as an append-only event log, never as mutable rows. Each buy-in, rebuy, add-on, and payout is an immutable event with a monotonic sequence number assigned by the Leaf node.
- Never use wall clock timestamps to order conflicting financial events — use lamport clocks or NATS sequence numbers as the canonical ordering.
- The prize pool balance is always derived from the event log (computed, never stored directly), so a re-play of all events always produces the correct total.
- Mark financial events as requiring explicit human conflict resolution (surface a UI alert) rather than auto-merging on the rare case of genuine conflicts.

**Warning signs:**
- Prize pool total doesn't match sum of individual player buy-in records
- Rebuy count in player history differs from rebuy count in prize pool calculation
- Sync error logs showing sequence number gaps in NATS stream replay

**Phase to address:** Phase 1 (financial engine design) — must be an architectural decision, not retrofittable.

---

### Pitfall 4: Pi Zero 2W Memory Exhaustion Crashing Display Node

**What goes wrong:**
The Raspberry Pi Zero 2W has 512MB RAM. Chromium running a full WebGL/Canvas display (animated tournament clock, live chip counts, scrolling rankings) can consume 300-400MB alone, leaving minimal headroom. Memory pressure causes the kernel OOM killer to terminate Chromium mid-display. Without a proper watchdog and restart mechanism, the display node goes dark silently — venue staff don't notice until a player complains. Over multi-hour tournaments, memory leaks in JavaScript (uncollected WebSocket message handlers, accumulated DOM nodes) compound this.

**Why it happens:**
Development happens on a desktop or full Raspberry Pi 4 with 4-8GB RAM. The Zero 2W constraint is not felt until hardware testing. Display views are designed with visual richness in mind, without profiling memory consumption on the target hardware. WebSocket reconnection handlers that fail to deregister previous listeners create unbounded listener growth.

**How to avoid:**
- Test ALL display views on actual Pi Zero 2W hardware from day one, not just functionality but memory usage (use `chrome://memory-internals` or external monitoring).
- Set Chromium flags: `--max-old-space-size=256 --memory-pressure-off` — counterintuitively, disabling pressure signals can prevent thrashing.
- Enable `zram` on the Pi Zero 2W (compressed swap) — adds ~200MB effective memory, documented to make Chromium "usable" on constrained devices.
- Implement a kiosk watchdog service (systemd `Restart=always` + `MemoryMax=450M`) that restarts Chromium if it exceeds memory limits.
- Use Server-Sent Events (SSE) instead of WebSocket for display-only views — reduces connection overhead and eliminates bidirectional state machine complexity where one-way push is sufficient.
- Implement manual listener cleanup in all WebSocket event handlers: always call `removeEventListener` in cleanup functions.

**Warning signs:**
- Display works fine for 30 minutes, then goes blank
- `dmesg` on Pi Zero shows `oom-kill` entries
- Memory usage climbs monotonically over tournament duration when profiling

**Phase to address:** Phase 1 (display node MVP) — must validate on target hardware before building more display views.

---

### Pitfall 5: Tournament Table Rebalancing Algorithm Producing Unfair or Invalid Seating

**What goes wrong:**
Table rebalancing when players bust out is operationally critical and algorithmically subtle. Common failures: (1) moving a player who just posted their blind, creating a situation where they post blind twice before being able to act; (2) breaking a table that still has enough players to stay open; (3) choosing a player to move who has the dealer button (invalid in most rule sets); (4) the algorithm enters an infinite loop when there is no valid move (e.g., exactly balanced tables that can't be further balanced without breaking one). The Tournament Director software has a known bug where "if the dealer button is set to a non-valid seat, a table balance can cause the application to lock-up."

**Why it happens:**
Developers implement the "move player from biggest table to smallest table" happy path, then discover edge cases through production tournaments. Poker TDA rules for balancing are complex and context-dependent. The interaction between dealer button position, blind positions, and move eligibility is not obvious.

**How to avoid:**
- Implement rebalancing as a pure function that takes complete tournament state and returns a list of moves — enables exhaustive unit testing with edge cases.
- Consult the Poker TDA Rules (2024 edition) as the authoritative reference for rebalancing procedures (Rules 25-28 cover table balancing and player movement).
- Test edge cases explicitly: single player remaining, two players at same table count, dealer button at last seat, player in small blind position being the move candidate.
- Expose a "dry run" rebalancing mode in the UI that shows proposed moves before executing — operators can catch bad suggestions before players are physically moved.
- Never auto-apply rebalancing; always require operator confirmation.

**Warning signs:**
- Players complaining about double-posting blinds after a move
- Tournament stuck unable to proceed after table break
- Rebalancing suggestion moving the dealer button holder

**Phase to address:** Phase 1 (seating engine) — core algorithm must be implemented and unit tested exhaustively before tournament testing.

---

### Pitfall 6: Multi-Tenant RLS Policy Leaking Venue Data at Application Layer

**What goes wrong:**
PostgreSQL Row Level Security on Core provides database-level tenant isolation, but RLS has a critical operational failure mode: if the application sets the `app.tenant_id` session variable (the common pattern for RLS) using a connection pool that reuses connections between requests, a previous request's tenant ID can bleed into the next request's session. This is a documented thread-safety issue — "some requests were being authorized with a previous request's user id because the user id for RLS was being stored in thread-local storage and threads were being reused for requests."

**Why it happens:**
RLS tutorials typically show setting `SET LOCAL app.tenant_id = $1` inside a transaction, which is safe. But connection pools that don't reset session state between checkouts, or code that uses `SET` instead of `SET LOCAL`, cause session variables to persist across requests. In a multi-tenant venue management platform, this means venue A could see venue B's player data.

**How to avoid:**
- Always use `SET LOCAL app.tenant_id = $1` (transaction-scoped), never `SET app.tenant_id` (session-scoped).
- Use a connection pool that explicitly resets session state on checkout (pgBouncer in transaction mode is safer than session mode for this pattern).
- Add an application-layer assertion: before every query, verify that `current_setting('app.tenant_id')` matches the expected tenant from the JWT/session — log and reject any mismatch as a security event.
- Write integration tests that explicitly test cross-tenant isolation: authenticate as venue A, attempt to query venue B's data through all API endpoints.

**Warning signs:**
- Flaky test failures where tenant data appears for wrong user (intermittent = session state bleed)
- Query results contain more rows than expected for a given venue's player count
- Connection pool configured in session mode (pgBouncer `pool_mode = session`)

**Phase to address:** Phase 1 (Core backend data model) — RLS policies and the tenant isolation integration tests must be built before any multi-venue feature.

---

### Pitfall 7: Financial Calculations Using Floating-Point Arithmetic

**What goes wrong:**
Prize pool calculations, rake, bounties, and payout distributions calculated using `float64` accumulate representation errors. `0.1 + 0.2 != 0.3` in IEEE 754. A tournament with 127 players at €55 buy-in with 10% rake, split across 12 payout positions with percentage-based distribution, will produce cent-level errors that cascade: the sum of individual payouts does not equal the prize pool total, creating a money reconciliation error. In a live venue, this is a regulatory and reputational risk.

**Why it happens:**
Go's `float64` feels precise enough for small numbers. Developers write `prizePool := float64(buyIns) * 0.9` and don't notice that `0.9` is not exactly representable. The error is invisible until summing many such calculations.

**How to avoid:**
- Store and compute ALL financial values as `int64` representing the smallest currency unit (eurocents for EUR). €55 = 5500 cents.
- Never store or compute monetary values as `float64`. If you need percentage-based rake, multiply first then divide: `rake = (buyInCents * rakePercent) / 100` using integer division.
- For payout percentage splits (e.g., 35.7% to 1st place), compute each payout as `(prizePool * 357) / 1000` in integer arithmetic, then distribute any remainder (due to truncation) to the first payout position.
- Write a test that sums all individual payouts and asserts equality to the prize pool total — this test will catch floating-point drift immediately.

**Warning signs:**
- Prize pool "total" displayed in UI differs by €0.01-0.02 from sum of individual payouts
- Rake calculation produces fractional cents
- Any use of `float64` in the financial calculation code path

**Phase to address:** Phase 1 (financial engine) — this is a zero-compromise architectural constraint, not a later optimization.

---

## Moderate Pitfalls

### Pitfall 8: ARM64 CGO Cross-Compilation Blocking CI/CD

**What goes wrong:**
Go cross-compilation for `GOARCH=arm64` is trivial for pure Go code: set `GOOS=linux GOARCH=arm64` and the toolchain handles everything. The moment any dependency uses CGO (C bindings), this breaks. CGO requires a target-specific C toolchain (`aarch64-linux-gnu-gcc`). LibSQL's Go bindings may require CGO. If not caught early, the CI/CD pipeline that builds the Leaf node binary fails on x86-64 build agents, blocking all Leaf deployments.

**Prevention:**
- Audit all dependencies for CGO usage before finalizing the stack: `go build -v ./... 2>&1 | grep cgo`.
- For any CGO dependency, set up Docker Buildx with `--platform linux/arm64` multi-arch builds from the start.
- Alternatively, choose pure-Go alternatives where possible: `modernc.org/sqlite` (CGO-free SQLite driver) vs `mattn/go-sqlite3` (requires CGO). Validate LibSQL Go driver CGO requirements against the latest release.
- Use a dedicated ARM64 build runner (e.g., a Hetzner CAX11 ARM instance) as the canonical Leaf build environment rather than cross-compiling.

**Warning signs:**
- `cannot execute binary file: Exec format error` when deploying to Leaf
- CI build succeeds on x86 runner but produces wrong binary
- Build scripts not setting `GOARCH` and `GOOS` explicitly

**Phase to address:** Phase 1 (build system setup) — establish cross-compilation pipeline before writing any Leaf-specific code.

---

### Pitfall 9: WebSocket State Desync — Server Restarts vs. Client State

**What goes wrong:**
The Leaf backend restarts (deploy, crash, OOM). All WebSocket connections drop. When clients reconnect, they receive a "current state" snapshot. But if the snapshot is emitted before all pending database writes have completed (race condition between reconnect handler and write completion), clients receive a stale snapshot and operate on incorrect state. Operators may see a tournament clock that's 45 seconds behind reality, or chip counts from before the last bust-out.

**Why it happens:**
WebSocket reconnect handlers typically emit state immediately on connection establishment. If the server restart was triggered by a deploy, in-flight writes from the last moments before restart may not have been committed. The reconnect handler races against database recovery.

**How to avoid:**
- Implement sequence numbers on all state updates. Every WebSocket message carries a monotonic `seq` field. Clients detect gaps in sequence and request a full resync rather than trusting a partial update.
- On server restart, wait for LibSQL WAL to be fully checkpointed before accepting new WebSocket connections (add a health check gate).
- Implement idempotent state application on the client: applying the same state update twice produces the same result (prevents double-application of duplicate messages during reconnect window).
- Only one goroutine writes to a WebSocket connection at a time (Go WebSocket constraint) — use a dedicated send goroutine with a buffered channel.

**Warning signs:**
- Tournament clock jumps backward after reconnect
- Chip counts inconsistent between two operators' screens after server restart
- Client receiving sequence numbers with gaps

**Phase to address:** Phase 1 (real-time sync architecture) — must be designed correctly from the first WebSocket implementation.

---

### Pitfall 10: SBC Hardware Reliability — Power Loss During Tournament

**What goes wrong:**
The Leaf node (Orange Pi 5 Plus) loses power mid-tournament. Even with NVMe (superior to SD card), an unclean shutdown can corrupt the filesystem if writes were in-flight. LibSQL's WAL may be partially written. The NATS JetStream store directory may be in an inconsistent state. On restoration, the system may fail to start, or worse, start with corrupted state that appears valid.

**Why it happens:**
SBC deployments in venues are not enterprise environments. Power strips get kicked. UPS systems are absent. NVMe is more reliable than SD but is not immune to corruption on power loss — the OS ext4 journal and SQLite WAL both need clean shutdown to guarantee consistency.

**How to avoid:**
- Mount the NVMe with a journaling filesystem configured for ordered data mode: `ext4` with `data=ordered` (default on most distros, but verify).
- Run LibSQL with `PRAGMA synchronous = FULL` (or at minimum `NORMAL`) and `PRAGMA journal_mode = WAL` — already planned, but verify synchronous mode specifically.
- Configure NATS `sync_interval: always` (Pitfall 1 above) to ensure JetStream state is on disk before any event is ACK'd.
- Implement a daily backup cron job that copies the LibSQL database to a USB drive or cloud (Hetzner Storage Box) — gives a recovery point even if local corruption is total.
- Add a systemd `ExecStop` hook that runs `PRAGMA wal_checkpoint(TRUNCATE)` before the process exits, minimizing WAL state at shutdown.

**Warning signs:**
- Venues skipping UPS/surge protector hardware
- Leaf node failing to start after power cycle during testing
- Filesystem errors in `dmesg` on startup after unclean shutdown

**Phase to address:** Phase 1 (Leaf node infrastructure setup) — backup strategy and sync configuration must be part of initial deployment scripts.

---

### Pitfall 11: GDPR Violation — Storing Player PII on Leaf Without Consent Mechanism

**What goes wrong:**
Player names, contact details, and tournament history are stored on the Leaf node for offline operation. If the venue is in the EU (the target market is Danish/European), this constitutes processing of personal data under GDPR. Without explicit consent capture at registration, documented data retention policies, and a right-to-erasure mechanism, the venue operator is liable. Fines can reach €20 million or 4% of global annual turnover. The platform architecture (players belong to Felt, not venues) amplifies risk — Felt is the data controller for the platform-level player profile.

**Why it happens:**
Tournament management software traditionally treats player data as operational, not personal. Developers focus on functionality first. GDPR compliance is deferred as "legal's problem." The offline-first architecture compounds this: data on edge devices is harder to audit and harder to delete on-request.

**How to avoid:**
- Define the data model from day one with GDPR in mind: separate PII fields (name, email, phone) from operational data (chip count, tournament position). This allows selective erasure without corrupting tournament history.
- Implement a "right to erasure" API endpoint that anonymizes PII (replaces name with "Player [ID]", nullifies contact fields) while preserving tournament result records for statistical purposes.
- Leaf node must be encrypted at rest (LUKS — already planned). Verify LUKS is set up in the provisioning flow.
- Data retention: document a default retention policy (e.g., player PII deleted after 12 months of inactivity) and implement automated enforcement.
- Consent must be captured before storing player contact details — the player registration flow must include explicit consent.

**Warning signs:**
- Player registration form collecting email/phone without a consent checkbox
- No data deletion API endpoint in the player management module
- Player data stored without any anonymization strategy for inactive accounts

**Phase to address:** Phase 1 (player management) for anonymization model; Phase 3 (platform maturity) for full GDPR compliance workflow.

---

### Pitfall 12: Netbird Management Server as a Single Point of Failure

**What goes wrong:**
The Leaf node connects to the self-hosted Netbird management server to establish WireGuard peers. If the Netbird management server is down, new peer connections cannot be established. In practice, once WireGuard peers are established, they maintain connectivity without the management server. But initial Leaf node boot, new display node enrollment, and player PWA access (via reverse proxy) all require the management server. A Netbird server outage at the start of a tournament is a critical incident.

**Why it happens:**
The Netbird management server is treated as infrastructure rather than a critical dependency. It runs on a single LXC container on Proxmox. Single-container deployments have no redundancy. Developers assume "it's just networking" and don't plan for management plane failures.

**How to avoid:**
- Run Netbird management server with Proxmox backup jobs (PBS daily backup) so restoration is fast if the container fails.
- Implement a startup procedure that verifies Netbird connectivity before marking the Leaf as ready for tournament use — surfaces infrastructure failures before they affect operations.
- Once WireGuard peers are established on the Leaf and display nodes, they retain connectivity through management server outages (WireGuard doesn't need a control plane after handshake). Document this so staff know not to panic if the management UI is unreachable mid-tournament.
- Consider running Netbird management on a separate, simpler VM rather than the same Proxmox host as other Core services — reduces correlated failure risk.

**Warning signs:**
- Netbird management server on the same LXC as other Core services (single failure domain)
- No monitoring/alerting on management server health
- New Leaf provisioning untested after simulated management server outage

**Phase to address:** Phase 1 (infrastructure setup) — define the Netbird deployment topology before provisioning hardware.

---

### Pitfall 13: Player PWA Stale Service Worker Serving Old App Version

**What goes wrong:**
SvelteKit PWA service workers cache the application for offline use. When the operator deploys a new version of the player PWA, players who have the old version cached via service worker continue seeing the old app. If a new API contract is introduced (e.g., a new field in the tournament state WebSocket message), the old client silently ignores or mishandles it. In the worst case, an old client submits a rebuy request using the old API shape, which the new server rejects, and the player receives no feedback.

**Why it happens:**
Service worker update mechanics are subtle. The browser downloads the new service worker but doesn't activate it until all existing tabs running the old worker are closed. In a venue environment, players' phones keep the browser open throughout the tournament (for live clock viewing). The tab never closes, so the update never activates.

**How to avoid:**
- Configure the service worker to use `skipWaiting()` and `clients.claim()` to force immediate activation of new service worker versions — accept the tradeoff that this can disrupt in-flight requests.
- Implement a version header in all API responses. The client checks the server version on every WebSocket connection and forces a full page reload if the versions diverge.
- Use the `vite-pwa/sveltekit` plugin for zero-config PWA setup — it handles cache busting and update notifications correctly.
- During development, test service worker update behavior explicitly: deploy a version, open the app, deploy again, verify the client updates without manual browser restart.

**Warning signs:**
- Players reporting "the clock stopped updating" after a deploy
- API errors logged for requests with old field names/shapes after a schema change
- Service worker version in browser devtools differs from current deploy

**Phase to address:** Phase 1 (player PWA setup) — configure service worker update behavior before the PWA goes live.

---

## Technical Debt Patterns

| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Store prize pool as `float64` | Faster initial implementation | Cent-level rounding errors, reconciliation failures | Never |
| Skip WAL checkpoint configuration | Works fine in dev | WAL grows unbounded under tournament load, write stalls | Never |
| Copy NATS default config | Fast bootstrap | Data loss on power failure | Never for financial events |
| Hard-code venue ID (single-tenant) | Simplifies first version | Full schema migration to add multi-tenancy later | Only for pre-alpha validation |
| Use `SET` instead of `SET LOCAL` for RLS tenant | Slightly simpler code | Cross-tenant data leak in connection pool | Never |
| Skip Pi Zero 2W hardware testing | Faster UI iteration | Memory issues only discovered in production | Never — test on target hardware early |
| Auto-apply table rebalancing | Faster UX | Incorrect moves enforced without operator awareness | Never in a live tournament |
| Mock Netbird/WireGuard in dev | Faster development cycle | Networking issues only found at venue deployment | Acceptable in unit test phase; must integration-test before deploy |

---

## Integration Gotchas

| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| NATS embedded server | Copy quickstart config with default `sync_interval` | Set `sync_interval: always` in the embedded server options struct before starting |
| LibSQL cloud sync | Initiate sync inside an open write transaction | Gate sync behind a mutex; never overlap sync with write transaction |
| LibSQL + Go | Use `mattn/go-sqlite3` CGO driver | Evaluate `modernc.org/sqlite` (CGO-free) or LibSQL's own Go driver — verify CGO requirement for ARM64 cross-compile |
| PostgreSQL RLS | Use `SET app.tenant_id` (session-scoped) | Use `SET LOCAL app.tenant_id` inside every transaction |
| Netbird reverse proxy | Route all player PWA traffic through management server | Use Netbird's peer-to-peer WireGuard path; management server is only for control plane |
| SvelteKit service worker | Use default Workbox cache-first strategy for API calls | Use network-first for API responses, cache-first only for static assets |
| Chromium kiosk on Pi Zero 2W | No memory limits, default flags | Set `--max-old-space-size=256`, enable `zram`, use systemd `MemoryMax` |
| Go WebSocket | Multiple goroutines writing to the same connection | Single dedicated send goroutine per connection; other goroutines push to a channel |

---

## Performance Traps

| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| Unbounded WAL growth | Reads slow down as WAL grows; checkpoint stalls block writes | Manual checkpoint scheduling, `journal_size_limit` pragma | > 50 concurrent writes without checkpoint, or after 2-hour tournament |
| Pi Zero 2W memory leak in display | Display goes blank mid-tournament | Explicit listener cleanup, memory profiling on target hardware | After 60-90 minutes of continuous display operation |
| NATS consumer storm on reconnect | All clients subscribe simultaneously after Leaf restart, overwhelming broker | Implement jittered reconnect backoff (50-500ms random) | > 20 concurrent display/player clients reconnecting simultaneously |
| Full RLS policy evaluation on every query | Slow queries as tournament grows | Index the `tenant_id` column; keep RLS policies simple (avoid JOINs in policy) | > 10,000 player records per venue |
| Broadcasting entire tournament state on every WebSocket message | Bandwidth spike on player PWA reconnect | Send deltas (only changed fields) after initial full-state sync | > 50 concurrent player PWA connections |

---

## Security Mistakes

| Mistake | Risk | Prevention |
|---------|------|------------|
| Storing player PII without encryption on Leaf | GDPR violation if device lost/stolen | LUKS full-disk encryption on NVMe (planned — verify it's in provisioning scripts) |
| Reusable Netbird enrollment key for all Leaf nodes | If key leaks, attacker can enroll rogue devices | Use one-time enrollment keys per Leaf node; rotate after provisioning |
| RLS bypass via direct database connection | Venue A reads venue B's data if DB credentials leak | Restrict DB user to application role only; no superuser credentials in app connection string |
| PIN authentication without rate limiting | Brute-force PIN in offline mode | Implement exponential backoff after 5 failed PIN attempts, lockout after 10 |
| Serving player PWA over HTTP (non-HTTPS) | Service workers require HTTPS; also exposes player data | All player-facing endpoints must terminate TLS (Netbird reverse proxy with Let's Encrypt) |

---

## UX Pitfalls

| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| Auto-applying table rebalancing moves without confirmation | Operators don't know why players are being moved; incorrect moves go unchallenged | Always show proposed moves, require tap-to-confirm before executing |
| Tournament clock not visible on dark screens in a dim poker room | Operators squint, miss blind level changes | Dark-room-first design from day one: minimum 18pt font for clock, high contrast ratios > 7:1, Catppuccin Mocha base |
| Player PWA showing stale chip counts after reconnect | Players see incorrect stack sizes, distrust the platform | Show "last updated X seconds ago" indicator; force full resync on reconnect |
| Sound events playing at maximum volume | Dealers and players startled; venue disruption | Per-venue configurable volume, default to 50%, fade-in for alerts |
| Prize payout screen not showing running total vs. paid out total | Operator makes payout errors when managing multiple players simultaneously | Show real-time "remaining to pay out" counter on payout screen |

---

## "Looks Done But Isn't" Checklist

- [ ] **Tournament clock:** Verify pause/resume correctly adjusts all time-based triggers (blind level end, break start) — not just the display counter
- [ ] **Prize pool:** Verify sum of all individual payouts equals prize pool total (run automated reconciliation test)
- [ ] **Table rebalancing:** Verify algorithm handles all TDA edge cases: last 2-player table, dealer button seat, player in blind position
- [ ] **Offline mode:** Verify full tournament can run (including rebuys, bust-outs, level changes) with internet completely disconnected for 4+ hours
- [ ] **Display node restart:** Verify display node automatically rejoins and resumes correct view after reboot without operator intervention
- [ ] **NATS replay:** Verify all queued events replay correctly after Leaf comes back online after 8+ hours offline
- [ ] **RLS isolation:** Verify API endpoints return 0 results (not 403) for valid venue A token querying venue B data — 403 leaks resource existence
- [ ] **GDPR erasure:** Verify player PII deletion does not delete tournament result records (anonymize, don't delete)
- [ ] **NATS fsync:** Verify `sync_interval: always` is in the deployed Leaf configuration (not just development)
- [ ] **Pi Zero memory:** Verify display node shows no memory growth after 4-hour continuous tournament with `/usr/bin/free -m` monitoring

---

## Recovery Strategies

| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| NATS data loss (wrong fsync default in production) | HIGH | Restore from last backup; replay any manually recorded tournament events; accept some data loss |
| SQLite corruption after power loss | MEDIUM | LibSQL can often repair WAL with `PRAGMA integrity_check`; if not, restore from daily backup |
| Prize pool floating-point error discovered post-tournament | HIGH | Manual audit of all transactions; correction requires agreement of all players involved |
| RLS cross-tenant data leak | HIGH | Immediate incident response; audit logs for all affected queries; notify affected venues per GDPR breach requirements (72-hour deadline) |
| Pi Zero display failure mid-tournament | LOW | Display nodes are stateless — reboot restores operation within 60 seconds; have spare Pi Zero on-site |
| Table rebalancing error (player moved incorrectly) | MEDIUM | Manual seat correction via operator UI; document as a tournament irregularity in audit log |
| Service worker serving stale PWA | LOW | Force browser refresh (player gesture); if critical, add server-side cache-busting header |

---

## Pitfall-to-Phase Mapping

| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| NATS default fsync data loss | Phase 1 (NATS setup) | Integration test: write 100 events, power-cycle Leaf, verify all 100 replay correctly |
| LibSQL WAL checkpoint stall | Phase 1 (database initialization) | Load test: 500 writes/min for 2 hours, monitor WAL file size stays bounded |
| Offline sync financial conflict | Phase 1 (financial engine architecture) | Conflict test: process rebuy offline, simulate late online event with same sequence, verify ledger correctness |
| Pi Zero memory exhaustion | Phase 1 (display node MVP) | Soak test: run display on actual Pi Zero 2W hardware for 4 hours, monitor RSS |
| Table rebalancing algorithm | Phase 1 (seating engine) | Unit tests covering all TDA edge cases (min 20 cases); load test with simulated 40-table tournament |
| Multi-tenant RLS data leak | Phase 1 (Core backend) | Security test: verify all 30+ API endpoints return correct tenant-scoped data only |
| Float arithmetic in financials | Phase 1 (financial engine) | Automated test: sum of payouts must equal prize pool (run as CI gate) |
| ARM64 CGO cross-compile | Phase 1 (build system) | CI gate: ARM64 binary builds successfully and passes smoke test on Orange Pi 5 Plus |
| WebSocket state desync | Phase 1 (real-time sync) | Chaos test: restart Leaf server mid-tournament, verify all clients resync within 5 seconds |
| SBC power loss data corruption | Phase 1 (infrastructure) | Chaos test: hard-power-cycle Leaf mid-tournament 10 times, verify restart always recovers cleanly |
| GDPR compliance | Phase 1 (player management) + Phase 3 | Verify: right-to-erasure API anonymizes PII, preserves results; audit trail shows all PII access |
| Netbird management SPOF | Phase 1 (infrastructure design) | Test: take Netbird management offline, verify existing WireGuard peers retain connectivity |
| PWA stale service worker | Phase 1 (PWA setup) | Test: deploy v1, open app, deploy v2, verify client shows v2 without manual browser restart |

---

## Sources

- [NATS JetStream Anti-Patterns for Scale — Synadia](https://www.synadia.com/blog/jetstream-design-patterns-for-scale) (MEDIUM confidence — official vendor blog)
- [Jepsen: NATS 2.12.1 — jepsen.io](https://jepsen.io/blog/2025-12-08-nats-2.12.1) (HIGH confidence — independent analysis, Dec 2025)
- [NATS JetStream loses acknowledged writes by default — GitHub Issue #7564](https://github.com/nats-io/nats-server/issues/7564) (HIGH confidence — official tracker)
- [Downsides of Local First / Offline First — RxDB](https://rxdb.info/downsides-of-offline-first.html) (MEDIUM confidence — library author perspective)
- [SQLite Write-Ahead Logging — sqlite.org](https://sqlite.org/wal.html) (HIGH confidence — official documentation)
- [LibSQL Embedded Replicas Data Corruption — GitHub Discussion #1910](https://github.com/tursodatabase/libsql/discussions/1910) (HIGH confidence — official tracker)
- [Turso Offline Sync Public Beta](https://turso.tech/blog/turso-offline-sync-public-beta) (MEDIUM confidence — vendor announcement)
- [PostgreSQL RLS Implementation Guide — permit.io](https://www.permit.io/blog/postgres-rls-implementation-guide) (MEDIUM confidence — verified against AWS prescriptive guidance)
- [Multi-tenant Data Isolation with PostgreSQL RLS — AWS](https://aws.amazon.com/blogs/database/multi-tenant-data-isolation-with-postgresql-row-level-security/) (HIGH confidence — official AWS documentation)
- [Floats Don't Work for Storing Cents — Modern Treasury](https://www.moderntreasury.com/journal/floats-dont-work-for-storing-cents) (HIGH confidence — multiple corroborating sources)
- [SQLite WAL Checkpoint Starvation — sqlite-users](https://sqlite-users.sqlite.narkive.com/muT0rMYt/sqlite-wal-checkpoint-starved) (MEDIUM confidence — community discussion)
- [Chromium on Pi Zero 2W memory constraints — Raspberry Pi Forums](https://forums.raspberrypi.com/viewtopic.php?t=326222) (MEDIUM confidence — community-verified)
- [NetBird 2025 Guide: 5 Critical Mistakes](https://junkangworld.com/blog/your-2025-netbird-guide-5-critical-mistakes-to-avoid) (LOW confidence — third-party blog, verify against official docs)
- [SvelteKit Service Workers — official docs](https://kit.svelte.dev/docs/service-workers) (HIGH confidence — official documentation)
- [The Tournament Director — known bugs changelog](https://thetournamentdirector.net/changes.txt) (HIGH confidence — official changelog)
- [Poker TDA Rules 2013 — table balancing procedures](https://www.pokertda.com/wp-content/uploads/2013/08/Poker_TDA_Rules_2013_Version_1.1_Final_handout_PDF_redlines_from_2011_Rules.pdf) (MEDIUM confidence — check against current TDA ruleset)
- [GDPR compliance for gaming/gambling operators — GDPR Local](https://gdprlocal.com/gdpr-compliance-online-casinos-betting-operators/) (MEDIUM confidence — legal advisory blog, not authoritative)

---
*Pitfalls research for: edge-cloud poker venue management platform (Felt)*
*Researched: 2026-02-28*