diff --git a/docs/AUTH_RESEARCH.md b/docs/AUTH_RESEARCH.md new file mode 100644 index 0000000..b48ed8e --- /dev/null +++ b/docs/AUTH_RESEARCH.md @@ -0,0 +1,805 @@ +# PVM Authentication Framework Research + +> **Date:** 2025-02-08 +> **Status:** Final +> **Author:** Research Agent (Claude) + +--- + +## 1. Executive Summary + +**Recommendation: Zitadel (self-hosted) + lightweight JWT validation on local nodes.** + +After evaluating 11 authentication frameworks against PVM's unique split-brain architecture requirements, Zitadel emerges as the clear winner for these reasons: + +1. **Official Rust/Axum crate** (`zitadel` on crates.io) with dedicated Axum middleware, introspection, and OIDC modules -- no other auth platform has this level of first-class Rust support. +2. **Official SvelteKit integration** via Auth.js with documented PKCE flow, maintained by the Zitadel team. +3. **Self-hosted on PostgreSQL** (v3+ requires PostgreSQL, dropping CockroachDB) -- PVM already uses PostgreSQL 16+, so Zitadel shares the same database engine with zero additional database infrastructure. +4. **Standard OIDC/OAuth2 with JWKS endpoint** -- the RPi5 local nodes cache the JWKS public keys and validate JWTs entirely offline. No auth server needed on the Pi. +5. **AGPL v3 license** -- fine for PVM since we use Zitadel as-is (not modifying its source code), and it runs as an independent service. +6. **Resource-efficient** -- runs on 512MB RAM + 1 CPU for test environments, 1-2GB RAM + 2-4 CPUs for production. Fits comfortably on Hetzner PVE. +7. **Full feature coverage** -- social login (Google, Apple, Facebook), email+password, phone+password, TOTP/MFA, passkeys, magic links, RBAC, admin console, audit logs. +8. **Free forever when self-hosted** -- no MAU limits, no feature gating on the self-hosted version. + +**Runner-up: Ory (Kratos + Hydra)** -- more flexible but significantly more complex to operate (two services, custom UI required, manual integration between components). + +**Third place: Keycloak** -- battle-tested but Java-based, heavy on resources (1.25GB+ RAM minimum), no Rust SDK, and requires more memory than Zitadel for equivalent workloads. + +--- + +## 2. The Split-Brain Auth Challenge + +### The Problem + +PVM has a distributed architecture where a player's phone can talk to either: +- **The cloud** (Hetzner PVE) -- the primary SaaS backend +- **A local RPi5 node** at a poker venue -- for low latency and offline resilience + +The local node may be offline for up to 72 hours. When online, it syncs via NATS JetStream. This creates a fundamental auth challenge: + +``` +Player Phone + | + |-- (mDNS discovery) --> RPi5 Local Node (may be offline) + | + |-- (internet) -------> Cloud SaaS (Hetzner PVE) +``` + +**Auth tokens issued by the cloud must be valid on the local node, and vice versa**, without the local node calling home to verify them. + +### The Solution Pattern + +The only viable approach for offline token validation is **asymmetric JWT signing with cached JWKS**: + +1. **Zitadel runs on the cloud** (Hetzner PVE), issuing JWTs signed with RS256 (RSA) or ES256 (ECDSA) private keys. +2. **The JWKS (public keys) are published** at a standard `/.well-known/jwks.json` endpoint. +3. **Each RPi5 node caches the JWKS** when it syncs with the cloud. The cache is refreshed on every NATS sync cycle. +4. **When offline, the RPi5 validates JWTs** using only the cached public keys -- pure cryptographic verification, no network calls. +5. **Token refresh** happens against whichever endpoint is reachable (cloud or local). The local node can issue short-lived tokens that are also verifiable by the cloud (using the same or a federated key trust). + +### Key Design Decisions + +| Decision | Choice | Rationale | +|---|---|---| +| Signing algorithm | RS256 or ES256 | Asymmetric -- public key can be distributed freely | +| Token format | JWT (access token) + opaque refresh token | JWTs are self-contained and verifiable offline | +| JWKS caching | On RPi5 via NATS sync | Ensures offline validation even after 72h | +| Token lifetime | Access: 15min, Refresh: 7 days | Short access tokens limit blast radius; refresh tokens cover offline periods | +| Auth server location | Cloud only | RPi5 does JWT validation only, not token issuance | +| Social login | Cloud only (OAuth requires internet) | Cloud issues PVM JWT after social auth completes | + +--- + +## 3. Evaluation Matrix + +### Scoring Key +- **A** = Excellent fit +- **B** = Good fit with minor gaps +- **C** = Usable but significant caveats +- **D** = Poor fit +- **F** = Does not work + +| Framework | Rust SDK | SvelteKit | Self-Hosted | Free Tier | Social Login | MFA/2FA | JWT/JWKS | Resource Needs | Split-Brain Fit | Overall | +|---|---|---|---|---|---|---|---|---|---|---| +| **Zitadel** | A (official crate) | A (Auth.js example) | A (Docker+PG) | A (unlimited self-hosted) | A (Google/Apple/FB+) | A (TOTP, passkeys) | A (standard OIDC JWKS) | B (512MB-2GB) | A | **A** | +| **Ory (Kratos+Hydra)** | B (auto-gen SDK) | B (community kit) | A (Go binaries) | A (fully OSS) | A (via Kratos) | A (TOTP, WebAuthn) | A (Hydra JWKS) | A (lightweight Go) | A | **B+** | +| **Keycloak** | D (no SDK, REST API) | B (OIDC generic) | A (Docker) | A (fully OSS) | A (built-in) | A (TOTP, WebAuthn) | A (JWKS) | C (1.25GB+ RAM, Java) | A | **B** | +| **Logto** | D (no Rust SDK) | B (OIDC generic) | A (Docker, Node.js) | A (unlimited self-hosted) | A (20+ providers) | A (TOTP, passkeys) | A (OIDC JWKS) | B (~512MB-1GB) | A | **B** | +| **Authentik** | D (no SDK, REST/OIDC) | B (OIDC generic) | A (Docker) | A (fully OSS) | A (broad) | A (TOTP, WebAuthn) | A (OIDC JWKS) | C (2GB+ RAM, Python) | A | **B-** | +| **Auth0** | D (no SDK) | B (Auth.js) | F (cloud only) | B (25k MAU free) | A (built-in) | C (paid only) | A (JWKS) | N/A (managed) | B (vendor dep.) | **C+** | +| **Clerk** | C (community crate) | B (community svelte-clerk) | F (cloud only) | B (10k MAU free) | A (built-in) | A (built-in) | B (session tokens) | N/A (managed) | C (cloud-dependent) | **C** | +| **Supabase Auth** | D (no Rust SDK) | C (JS client) | B (GoTrue Docker) | B (50k MAU cloud) | A (built-in) | B (limited) | B (RS256 JWTs) | B (~512MB) | B (GoTrue only) | **C** | +| **SuperTokens** | D (no Rust SDK) | C (React SDK focus) | A (Docker core) | A (unlimited self-hosted) | A (built-in) | A (TOTP) | C (session-based, not JWT) | B (~1GB) | C (session model) | **C** | +| **Hanko** | D (no Rust SDK) | B (web components) | A (Docker, Go) | B (10k MAU cloud) | B (limited providers) | A (passkeys native) | B (OIDC) | A (lightweight Go) | B | **C+** | +| **Custom (Rust)** | A (own code) | A (own design) | A (embedded) | A (free) | C (build OAuth flows) | C (build TOTP) | A (jsonwebtoken crate) | A (no overhead) | A | **B-** | + +--- + +## 4. Deep Dive: Top 3 Candidates + +### 4.1 Zitadel (Recommended) + +**What it is:** A cloud-native identity management platform written in Go, providing full OIDC/OAuth2, SAML, and LDAP support with a built-in admin console. + +**Why it wins for PVM:** + +**Rust Integration (Best-in-Class)** +The `zitadel` crate (v5.5+) provides: +- `zitadel::axum` module with middleware for token introspection +- `zitadel::oidc` for OpenID Connect discovery and token validation +- `introspection_cache` feature flag for caching OIDC discovery and introspection results +- Feature flags: `axum`, `oidc`, `credentials`, `api`, `introspection_cache` + +```toml +# Cargo.toml +[dependencies] +zitadel = { version = "5", features = ["axum", "oidc", "introspection_cache"] } +``` + +**SvelteKit Integration (Official)** +Zitadel maintains an [official example](https://github.com/zitadel/example-auth-sveltekit) using `@auth/sveltekit` with: +- PKCE authorization code flow +- Automatic token refresh +- Server-side session management via SvelteKit load functions +- Federated logout with CSRF protection + +**Self-Hosting (Simple)** +- Single Go binary + PostgreSQL (PVM already has PG 16+) +- Docker Compose deployment in minutes +- v3+ requires PostgreSQL only (dropped CockroachDB) +- Resource needs: 512MB RAM (test), 1-2GB RAM + 2-4 CPUs (production) + +**Feature Completeness:** +- Social login: Google, Apple, Facebook, GitHub, GitLab, Microsoft, and more +- Email + password with customizable policies +- Phone number authentication +- TOTP, passkeys/FIDO2, email/SMS OTP +- Magic links / passwordless +- Built-in admin console (web UI) +- Multi-tenancy with organizations +- RBAC with roles and permissions +- Unlimited audit trail +- Branding and custom login pages +- Account linking across providers + +**Licensing:** +- AGPL v3 as of Zitadel v3 (March 2025) +- Using Zitadel as an identity service without modifying its source code is fine for commercial use +- SDKs and Protocol Buffer definitions remain Apache 2.0 +- A commercial license is available if AGPL is incompatible + +**Limitations:** +- AGPL may concern some organizations (but not PVM's use case) +- The Rust crate's introspection module requires network access to Zitadel for token introspection (but we use JWKS validation instead on the RPi5, which is offline-capable) +- Resource usage spikes during password hashing (4 CPU cores recommended for production) + +--- + +### 4.2 Ory (Kratos + Hydra) -- Runner-Up + +**What it is:** A suite of Go microservices -- Kratos for identity management, Hydra for OAuth2/OIDC, Keto for permissions, Oathkeeper for API gateway auth. + +**Why it's strong:** +- Written in Go, lightweight binaries (5-15MB each), low resource usage +- Kratos handles registration, login, MFA, social login, account recovery +- Hydra is OpenID Certified and handles OAuth2 + JWKS endpoint +- Auto-generated Rust SDK for both Kratos and Hydra APIs +- Fully open source (Apache 2.0 license) +- Can scale to billions of users (used by OpenAI per their claims) + +**Why it loses to Zitadel for PVM:** +- **Operational complexity:** You need to run Kratos AND Hydra as separate services, configure them to work together, and build a custom login/consent UI. This is significant engineering overhead. +- **No built-in admin UI:** You must build or find a third-party admin interface. +- **SvelteKit integration:** Only community examples exist (ory-kit by MarkusThielker), and development on the SvelteKit UI has stopped. +- **Rust SDK is auto-generated:** Works but lacks the ergonomics and Axum-specific middleware of Zitadel's crate. +- **Documentation complexity:** Setting up Kratos + Hydra together requires deep understanding of OAuth2 flows and significant configuration. + +**Resource requirements:** Very lightweight. Kratos idles at ~15-380MB depending on configuration. Hydra is similarly lean. Total for both: 256MB-1GB RAM. + +**Best for:** Teams that want maximum flexibility and are willing to invest in custom UI development and operational complexity. + +--- + +### 4.3 Keycloak -- Third Place + +**What it is:** The industry-standard open-source identity management platform, backed by Red Hat/JBoss, written in Java. + +**Why it's considered:** +- Most battle-tested solution in the market (used by thousands of enterprises) +- Full OIDC/OAuth2/SAML support with standard JWKS endpoints +- Built-in admin console, user management, social login, MFA +- Extensive documentation and community +- FAPI 2.0 compliant (Keycloak 26.4+) +- JWT Authorization Grant (RFC 7523) in Keycloak 26.5 + +**Why it loses for PVM:** +- **Java-based, resource-heavy:** Minimum 750MB RAM for a bare container, recommended 2GB for production. PVM's Hetzner PVE resources are better spent elsewhere. +- **No Rust SDK:** You'd use generic OIDC/JWT validation crates. The REST admin API works but has no Rust client. +- **Slower startup:** Java cold starts are measured in seconds, not milliseconds. +- **Overkill for PVM:** Enterprise features like SAML, LDAP federation, and Kerberos add complexity without value for a poker venue SaaS. +- **Theme customization:** Uses FreeMarker templates, which have a steep learning curve. + +**Resource requirements:** 1.25GB RAM base (including caches), recommended 2GB+ for production. 1-2 CPU cores minimum. + +**Best for:** Enterprises with existing Java infrastructure that need SAML/LDAP federation. + +--- + +## 5. Recommended Architecture + +### Overview + +``` + CLOUD (Hetzner PVE) + ┌──────────────────────────────────────────┐ + │ │ + │ ┌─────────┐ ┌──────────────────┐ │ + │ │ Zitadel │ │ PVM Cloud API │ │ + │ │ (Auth) │◄───►│ (Rust/Axum) │ │ + │ │ │ │ │ │ + │ │ PG DB │ │ PG DB │ │ + │ └────┬────┘ └────────┬─────────┘ │ + │ │ │ │ + │ │ JWKS endpoint │ NATS │ + │ │ /.well-known/ │ JetStream │ + │ │ jwks.json │ │ + └───────┼───────────────────┼──────────────┘ + │ │ + │ │ + ════════╪═══════════════════╪═══════ INTERNET + │ │ + ▼ ▼ + ┌──────────────────────────────────────────┐ + │ RPi5 LOCAL NODE │ + │ │ + │ ┌──────────────┐ ┌─────────────────┐ │ + │ │ Cached JWKS │ │ PVM Local API │ │ + │ │ (public keys)│◄─│ (Rust binary) │ │ + │ │ │ │ │ │ + │ │ Updated via │ │ libSQL DB │ │ + │ │ NATS sync │ │ │ │ + │ └──────────────┘ │ NATS leaf node │ │ + │ └─────────────────┘ │ + └──────────────────────────────────────────┘ + ▲ + │ mDNS discovery + local API calls + │ + ┌───────┴──────┐ + │ Player Phone │ + │ (SvelteKit) │ + └──────────────┘ +``` + +### Auth Flow: Registration & Login (Cloud) + +``` +1. Player opens PVM app (SvelteKit) +2. App detects network connectivity --> routes to cloud +3. Player chooses: email+password, phone+password, or social login (Google/Apple/Facebook) +4. SvelteKit redirects to Zitadel login page (OIDC Authorization Code + PKCE) +5. Zitadel handles the auth flow (including social OAuth if applicable) +6. Zitadel issues: + - Access token (JWT, signed RS256, 15min expiry) + - Refresh token (opaque, 7-day expiry) + - ID token (JWT with user claims) +7. SvelteKit stores tokens (httpOnly cookies for SSR, secure storage for SPA) +8. Cloud API validates JWT on each request using Zitadel's JWKS +``` + +### Auth Flow: Local Node (Offline-Capable) + +``` +1. Player phone discovers RPi5 via mDNS +2. Phone sends request to local API with existing JWT (from cloud login) +3. RPi5 Rust binary validates JWT: + a. Parse JWT header to get key ID (kid) + b. Look up public key in cached JWKS (stored in libSQL or memory) + c. Verify RS256 signature + d. Validate claims (exp, iss, aud, sub) +4. If JWT is expired but refresh token is available: + a. If cloud is reachable: refresh against Zitadel + b. If offline: issue a short-lived local token (signed with the node's key) + - The cloud trusts the node's public key (registered during provisioning) +5. Request is authenticated; proceed with venue operations +``` + +### Auth Flow: Token Refresh Strategy + +``` +Token Refresh Decision Tree: +│ +├── Cloud reachable? +│ ├── YES: Refresh against Zitadel (standard OIDC refresh) +│ │ └── New access token (15min) + new refresh token (7 days) +│ │ +│ └── NO: Is the refresh token still valid (< 7 days)? +│ ├── YES: Local node issues a "bridge token" +│ │ - Signed with node's key pair +│ │ - Short-lived (30 min) +│ │ - Contains original user claims from the expired JWT +│ │ - Marked with a "local_issued" claim +│ │ +│ └── NO: User must re-authenticate when cloud is reachable +│ (graceful degradation -- show "offline mode limited") +``` + +### JWKS Sync Strategy + +``` +1. On RPi5 boot / NATS reconnect: + - Fetch JWKS from Zitadel's /.well-known/jwks.json + - Store in libSQL (jwks table) with timestamp + - Also cache in memory (HashMap) + +2. Periodic refresh (every 1 hour while connected): + - Re-fetch JWKS + - Compare with cached version + - Update if changed (key rotation support) + +3. Via NATS JetStream: + - Cloud publishes "jwks.updated" event on key rotation + - RPi5 subscribes and refreshes immediately + +4. Offline fallback: + - Use last cached JWKS (stored in libSQL) + - Valid for up to 72 hours (matches offline window) + - Include 2-3 previous key versions to handle rotation during offline period +``` + +### Node Key Trust Model + +``` +1. RPi5 provisioning: + - Node generates its own RS256 key pair on first boot + - Public key is registered with the cloud PVM API via NATS + - Cloud stores node public keys in its database + +2. Local token issuance (offline refresh): + - Node signs "bridge tokens" with its private key + - Token includes: original user sub, node_id, "local_issued" flag + - When cloud comes back online, it can verify these tokens + using the registered node public key + +3. Cloud verification of local tokens: + - Check node_id claim + - Look up node's public key + - Verify signature + - Apply stricter authorization (local tokens get fewer permissions) +``` + +--- + +## 6. Implementation Considerations + +### 6.1 Rust/Axum Backend (Cloud) + +**Dependencies:** + +```toml +[dependencies] +# Zitadel integration (cloud API) +zitadel = { version = "5", features = ["axum", "oidc", "introspection_cache"] } + +# For the RPi5 local node (standalone JWT validation) +jsonwebtoken = "9" # JWT creation and validation +axum-jwt-auth = "0.4" # Axum middleware for JWT with JWKS + +# Supporting crates +serde = { version = "1", features = ["derive"] } +serde_json = "1" +reqwest = { version = "0.12", features = ["json"] } # For JWKS fetching +``` + +**Cloud API: Token validation with Zitadel crate** + +```rust +use zitadel::axum::introspection::{IntrospectedUser, IntrospectionStateBuilder}; +use axum::{Router, routing::get, extract::State}; + +// Option A: Use Zitadel's introspection (requires Zitadel to be reachable) +async fn protected_handler(user: IntrospectedUser) -> String { + format!("Hello, {}!", user.username) +} + +// Option B: Use standalone JWKS validation (works offline too) +// This is what the RPi5 uses, but the cloud can use it as well +use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm}; + +fn validate_jwt(token: &str, jwks: &CachedJwks) -> Result { + let header = jsonwebtoken::decode_header(token)?; + let kid = header.kid.ok_or(AuthError::MissingKid)?; + let key = jwks.get_key(&kid).ok_or(AuthError::UnknownKey)?; + let validation = Validation::new(Algorithm::RS256); + let token_data = decode::(token, key, &validation)?; + Ok(token_data.claims) +} +``` + +**RPi5 Local Node: Offline JWT validation** + +```rust +use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm, jwk::JwkSet}; + +struct JwksCache { + keys: HashMap, + last_updated: DateTime, +} + +impl JwksCache { + /// Load JWKS from libSQL on startup + async fn from_libsql(db: &Database) -> Result { + let row = db.query("SELECT jwks_json, updated_at FROM jwks_cache ORDER BY updated_at DESC LIMIT 1").await?; + let jwks: JwkSet = serde_json::from_str(&row.jwks_json)?; + let keys = jwks.keys.iter() + .filter_map(|jwk| { + let kid = jwk.common.key_id.as_ref()?; + let key = DecodingKey::from_jwk(jwk).ok()?; + Some((kid.clone(), key)) + }) + .collect(); + Ok(Self { keys, last_updated: row.updated_at }) + } + + /// Refresh from Zitadel (when online) + async fn refresh(&mut self, zitadel_url: &str) -> Result<()> { + let jwks_url = format!("{}/.well-known/jwks.json", zitadel_url); + let jwks: JwkSet = reqwest::get(&jwks_url).await?.json().await?; + // Store in libSQL for offline use + self.store_in_libsql(&jwks).await?; + // Update in-memory cache + self.update_keys(&jwks); + Ok(()) + } + + fn validate(&self, token: &str) -> Result { + let header = jsonwebtoken::decode_header(token)?; + let kid = header.kid.as_ref().ok_or(AuthError::MissingKid)?; + let key = self.keys.get(kid).ok_or(AuthError::UnknownKey)?; + + let mut validation = Validation::new(Algorithm::RS256); + validation.set_issuer(&["https://auth.pvm.example.com"]); + validation.set_audience(&["pvm-api"]); + + let data = decode::(token, key, &validation)?; + Ok(data.claims) + } +} +``` + +### 6.2 SvelteKit Frontend + +**Dependencies:** + +```bash +npm install @auth/sveltekit @auth/core +``` + +**Auth.js configuration with Zitadel:** + +```typescript +// src/auth.ts +import { SvelteKitAuth } from "@auth/sveltekit"; +import Zitadel from "@auth/core/providers/zitadel"; + +export const { handle, signIn, signOut } = SvelteKitAuth({ + providers: [ + Zitadel({ + issuer: "https://auth.pvm.example.com", + clientId: env.ZITADEL_CLIENT_ID, + clientSecret: env.ZITADEL_CLIENT_SECRET, + authorization: { + params: { + scope: "openid profile email", + }, + }, + }), + ], + callbacks: { + async jwt({ token, account }) { + if (account) { + token.accessToken = account.access_token; + token.refreshToken = account.refresh_token; + token.expiresAt = account.expires_at; + } + return token; + }, + async session({ session, token }) { + session.accessToken = token.accessToken; + return session; + }, + }, +}); +``` + +**Route protection:** + +```typescript +// src/routes/venue/+page.server.ts +import { redirect } from "@sveltejs/kit"; +import type { PageServerLoad } from "./$types"; + +export const load: PageServerLoad = async (event) => { + const session = await event.locals.auth(); + if (!session) { + throw redirect(303, "/auth/signin"); + } + return { session }; +}; +``` + +**Dual API client (cloud vs. local):** + +```typescript +// src/lib/api-client.ts +import { browser } from "$app/environment"; + +class PvmApiClient { + private cloudUrl: string; + private localUrl: string | null = null; + + constructor(cloudUrl: string) { + this.cloudUrl = cloudUrl; + } + + // Set when mDNS discovers a local node + setLocalNode(url: string) { + this.localUrl = url; + } + + async fetch(path: string, token: string, options?: RequestInit) { + // Try local first (lower latency), fall back to cloud + if (this.localUrl) { + try { + const res = await fetch(`${this.localUrl}${path}`, { + ...options, + headers: { Authorization: `Bearer ${token}`, ...options?.headers }, + signal: AbortSignal.timeout(2000), // 2s timeout for local + }); + if (res.ok) return res; + } catch { + // Local node unreachable, fall through to cloud + } + } + + return fetch(`${this.cloudUrl}${path}`, { + ...options, + headers: { Authorization: `Bearer ${token}`, ...options?.headers }, + }); + } +} +``` + +### 6.3 Zitadel Deployment on Hetzner PVE + +**Docker Compose (production):** + +```yaml +version: "3.8" +services: + zitadel: + image: ghcr.io/zitadel/zitadel:v3-latest + command: start-from-init --masterkey "${ZITADEL_MASTERKEY}" + environment: + ZITADEL_DATABASE_POSTGRES_HOST: postgres + ZITADEL_DATABASE_POSTGRES_PORT: 5432 + ZITADEL_DATABASE_POSTGRES_DATABASE: zitadel + ZITADEL_DATABASE_POSTGRES_USER_USERNAME: zitadel + ZITADEL_DATABASE_POSTGRES_USER_PASSWORD: "${ZITADEL_DB_PASSWORD}" + ZITADEL_EXTERNALDOMAIN: auth.pvm.example.com + ZITADEL_EXTERNALPORT: 443 + ZITADEL_EXTERNALSECURE: "true" + ZITADEL_TLS_MODE: external # TLS terminated at reverse proxy + ports: + - "8080:8080" + depends_on: + postgres: + condition: service_healthy + deploy: + resources: + limits: + memory: 2G + cpus: "4" + reservations: + memory: 512M + cpus: "1" + + postgres: + image: postgres:16-alpine + environment: + POSTGRES_DB: zitadel + POSTGRES_USER: zitadel + POSTGRES_PASSWORD: "${ZITADEL_DB_PASSWORD}" + volumes: + - zitadel-pg-data:/var/lib/postgresql/data + healthcheck: + test: ["CMD-SHELL", "pg_isready -U zitadel"] + interval: 5s + timeout: 5s + retries: 5 + deploy: + resources: + limits: + memory: 1G + cpus: "2" + +volumes: + zitadel-pg-data: +``` + +**Estimated resource usage on Hetzner PVE:** +- Zitadel: 512MB-2GB RAM, 1-4 CPU cores +- PostgreSQL (shared or dedicated): 256MB-1GB RAM +- **Total: ~1-3GB RAM** for the auth stack + +--- + +## 7. Security Considerations + +### 7.1 Token Security + +| Concern | Mitigation | +|---|---| +| Token theft | Short-lived access tokens (15min). Refresh tokens are opaque and stored server-side by Zitadel. | +| Token replay | Include `iat` (issued-at) and `jti` (JWT ID) claims. Local nodes can maintain a small replay cache. | +| Key compromise (cloud) | Zitadel supports key rotation. JWKS cache on RPi5 auto-updates. Revoke compromised keys immediately. | +| Key compromise (node) | Each node has its own key pair. Revoke a single node's key without affecting others. | +| Offline token abuse | "Bridge tokens" issued by local nodes are short-lived (30min) and carry reduced permissions. | +| JWKS cache staleness | 72h maximum offline window. Keys should have longer lifetimes than this. Include previous key versions in cache. | + +### 7.2 Social Login Security + +- All social OAuth flows terminate at the cloud (Zitadel handles the redirect dance) +- Zitadel validates social provider tokens and issues PVM JWTs +- The local node never sees social provider tokens -- only PVM JWTs +- PKCE is used for all authorization code flows (prevents code interception) + +### 7.3 MFA Considerations + +- TOTP enrollment happens via Zitadel (cloud) +- TOTP verification can work offline IF the local node has the user's TOTP secret (synced via NATS) + - **Recommendation:** For simplicity, require MFA only for sensitive operations routed to the cloud. Venue check-in at a local node uses standard JWT validation without MFA step-up. +- Passkeys/FIDO2 require the authenticator device, which is local to the user's phone -- works offline + +### 7.4 AGPL License Risk Assessment + +| Scenario | Risk | +|---|---| +| Using Zitadel as-is (our case) | No risk. AGPL allows use as a service without source disclosure. | +| Modifying Zitadel source code | Must share modifications under AGPL. Avoid this -- use Zitadel's extension points instead. | +| Linking Zitadel libraries in PVM code | The SDKs are Apache 2.0, so no issue. | +| Distributing Zitadel binary | Must provide source. Not our case -- we self-host only. | + +### 7.5 Threat Model for Split-Brain Auth + +**Threat:** Attacker compromises an RPi5 node and extracts the JWKS cache. +**Impact:** Low. JWKS contains only public keys. Attacker cannot forge tokens. + +**Threat:** Attacker compromises an RPi5 node and extracts the node's private key. +**Impact:** Medium. Attacker can forge "bridge tokens" for that node. Mitigation: revoke the node's key via cloud, notify affected users. + +**Threat:** Attacker presents a valid cloud JWT to a local node after the user's account is disabled in the cloud. +**Impact:** Medium. The local node cannot check account status while offline. Mitigation: short token lifetimes (15min), and process account revocations on next NATS sync. + +**Threat:** Replay attack with expired token during offline period. +**Impact:** Low. JWT `exp` claim is always checked. Expired tokens are rejected regardless of network state. + +--- + +## 8. Alternatives Considered (Detail) + +### 8.1 Auth0 + +- **Free tier:** 25,000 MAU (B2C), 500 MAU (B2B) +- **Pros:** Excellent documentation, many SDKs, built-in social login +- **Cons:** Cloud-only (no self-hosting), no MFA on free tier, expensive at scale ($240/mo for pro), vendor lock-in risks for split-brain architecture. The PVM local node would depend on cached JWKS from Auth0's cloud endpoint -- any Auth0 outage affects token validation. +- **Verdict:** Vendor dependency is unacceptable for an offline-first architecture. + +### 8.2 Clerk + +- **Free tier:** 10,000 MAU, 100 organizations +- **Pros:** Great DX, community Rust SDK (`clerk-rs`), community SvelteKit SDK (`svelte-clerk`) +- **Cons:** Cloud-only, session-based (not pure JWT), the Rust SDK is community-maintained with uncertain longevity. No self-hosting option means complete vendor dependency. +- **Verdict:** Cloud-only with session-based auth is fundamentally incompatible with offline local nodes. + +### 8.3 Supabase Auth (GoTrue) + +- **Free tier:** 50,000 MAU (cloud), unlimited self-hosted +- **Pros:** Simple JWT-based auth, supports RS256, lightweight GoTrue binary +- **Cons:** No Rust SDK, primarily designed as part of the Supabase ecosystem. Self-hosting GoTrue independently requires running it separate from Supabase, which is poorly documented. Limited social provider configuration. No admin UI when self-hosted standalone. +- **Verdict:** Too tightly coupled to the Supabase ecosystem. Could work as a lightweight option but lacks the identity management features PVM needs. + +### 8.4 Logto + +- **Free tier:** 50,000 MAU (cloud), unlimited self-hosted +- **Pros:** Modern UI, good documentation, OIDC/OAuth 2.1 compliant, RBAC built-in, 20+ social providers +- **Cons:** No Rust SDK (would need to use generic OIDC/JWT crates), Node.js-based (heavier than Go alternatives), relatively young project. SvelteKit support via generic OIDC. +- **Verdict:** Strong contender but loses to Zitadel on Rust integration. If Zitadel didn't have its Rust crate, Logto would be the top pick. + +### 8.5 SuperTokens + +- **Free tier:** Unlimited self-hosted (open source features), 5,000 MAU cloud +- **Pros:** Self-hosted is fully free, good documentation, session management with anti-CSRF +- **Cons:** No Rust SDK (Node.js, Python, Go only), session-based rather than JWT-focused (would need to run a SuperTokens sidecar), requires SuperTokens core Java service alongside your backend. +- **Verdict:** Session-based model doesn't fit split-brain offline validation. Running a Java core service adds unwanted complexity. + +### 8.6 Hanko + +- **Free tier:** 10,000 MAU (cloud), unlimited self-hosted (AGPL) +- **Pros:** Passkey-first (great future-proofing), lightweight Go binary, simple API, web components for frontend +- **Cons:** No Rust SDK, limited social login providers compared to Zitadel, smaller community, AGPL license (same as Zitadel). Passkey-first approach may frustrate users who prefer passwords. +- **Verdict:** Interesting for passkey-first apps but too narrow for PVM's diverse auth needs (social login, email+password, phone+password). + +### 8.7 Authentik + +- **Free tier:** Unlimited self-hosted (open source) +- **Pros:** Full-featured IdP, great admin UI, OIDC/OAuth2/SAML/LDAP/RADIUS support, active development +- **Cons:** No Rust SDK, Python/Django-based (2GB+ RAM minimum), heavier than Go-based alternatives. Designed primarily as a reverse-proxy auth provider for self-hosted services (Plex, Grafana, etc.), not as an embeddable auth API. +- **Verdict:** Excellent for homelab SSO but over-resourced and architecturally mismatched for PVM's API-first needs. + +### 8.8 Building Custom Auth in Rust + +**Available crates:** +- `jsonwebtoken` (v9) -- JWT creation and validation, RS256/ES256/EdDSA +- `oauth2` -- OAuth2 client flows +- `totp-rs` -- TOTP generation and validation +- `argon2` / `password-auth` -- Password hashing (Argon2id, OWASP recommended params) +- `axum-jwt-auth` -- Axum middleware for JWT with JWKS +- `openidconnect` -- Full OIDC client library + +**Estimated effort:** 4-8 weeks for a full auth system with social login, password auth, MFA, session management, account recovery, email verification, and admin UI. + +**Risks:** +- Auth is a security-critical system; bugs lead to breaches +- Ongoing maintenance burden (security patches, protocol updates) +- Social login requires implementing OAuth2 flows for each provider +- Account recovery, email verification, and brute force protection all need custom implementation +- No admin UI out of the box + +**Verdict:** The Rust ecosystem has excellent building blocks, but assembling them into a production auth system is a multi-month effort that Zitadel provides out of the box. The split-brain JWT validation part IS worth building custom (it's just `jsonwebtoken` + a JWKS cache), but the full identity management should be delegated to Zitadel. + +--- + +## 9. Final Recommendation & Next Steps + +### Architecture Decision + +| Component | Solution | +|---|---| +| Identity Provider | Zitadel v3 (self-hosted on Hetzner PVE) | +| Cloud API auth | Zitadel Rust crate (`zitadel::axum`) for introspection OR standalone JWKS validation | +| Local node auth | Custom JWT validation using `jsonwebtoken` crate + cached JWKS | +| Frontend auth | `@auth/sveltekit` with Zitadel OIDC provider | +| JWKS sync | NATS JetStream + periodic HTTP fetch | +| Token format | RS256 JWTs (access), opaque refresh tokens | +| Database | PostgreSQL 16 (shared with or separate from PVM's main database) | + +### Implementation Order + +1. **Week 1:** Deploy Zitadel on Hetzner PVE (Docker Compose + PostgreSQL). Configure social providers (Google, Apple, Facebook). Set up email+password and phone auth. +2. **Week 2:** Integrate SvelteKit frontend with Zitadel using `@auth/sveltekit`. Build login/signup flows. Test PKCE authorization code flow. +3. **Week 3:** Integrate Cloud Rust API with Zitadel. Use `zitadel::axum` for token validation. Implement user context extraction from JWT claims. +4. **Week 4:** Build JWKS caching on RPi5. Implement offline JWT validation with `jsonwebtoken`. Set up NATS-based JWKS sync. Test offline scenarios. +5. **Week 5:** Implement "bridge token" issuance on RPi5 for offline token refresh. Register node public keys with cloud. Test full split-brain auth flow. +6. **Week 6:** Enable MFA (TOTP). Configure branding and custom login pages. Security review and penetration testing. + +### Cost Estimate + +| Item | Cost | +|---|---| +| Zitadel (self-hosted) | $0 | +| PostgreSQL (already in stack) | $0 | +| Hetzner resources (incremental) | ~5-10 EUR/month for 2GB RAM + 2 CPU LXC | +| Social login API keys | $0 (Google, Apple, Facebook all free) | +| **Total** | **~5-10 EUR/month** | + +--- + +## Appendix: Sources + +- [Zitadel GitHub](https://github.com/zitadel/zitadel) +- [Zitadel Pricing](https://zitadel.com/pricing/detail) +- [Zitadel Rust Crate (docs.rs)](https://docs.rs/zitadel/latest/zitadel/) +- [Zitadel SvelteKit Example](https://github.com/zitadel/example-auth-sveltekit) +- [Zitadel Self-Hosting Specs](https://help.zitadel.com/what-are-zitadel-minimum-self-hosted-specs) +- [Zitadel v3 Announcement (AGPL, PostgreSQL)](https://zitadel.com/blog/zitadel-v3-announcement) +- [Zitadel License FAQ](https://zitadel.com/license-faq) +- [Ory Kratos GitHub](https://github.com/ory/kratos) +- [Ory Hydra GitHub](https://github.com/ory/hydra) +- [Ory Kratos Rust SDK](https://github.com/ory/kratos-client-rust) +- [Keycloak 26.5 Release](https://www.keycloak.org/2026/01/keycloak-2650-released) +- [Keycloak Memory Sizing](https://www.keycloak.org/high-availability/concepts-memory-and-cpu-sizing) +- [Auth0 Pricing](https://auth0.com/pricing) +- [Clerk Pricing](https://clerk.com/pricing) +- [Clerk Rust SDK](https://github.com/DarrenBaldwin07/clerk-rs) +- [Logto Pricing](https://logto.io/pricing) +- [Logto GitHub](https://github.com/logto-io/logto) +- [SuperTokens Pricing](https://supertokens.com/pricing) +- [Hanko GitHub](https://github.com/teamhanko/hanko) +- [Hanko Pricing](https://www.hanko.io/pricing) +- [Authentik GitHub](https://github.com/goauthentik/authentik) +- [Authentik Pricing](https://goauthentik.io/pricing/) +- [Supabase Auth GitHub](https://github.com/supabase/auth) +- [Supabase Pricing](https://supabase.com/pricing) +- [jsonwebtoken Rust Crate](https://github.com/Keats/jsonwebtoken) +- [axum-jwt-auth Crate](https://crates.io/crates/axum-jwt-auth) +- [Edge JWT Validation Patterns](https://securityboulevard.com/2025/11/how-to-validate-jwts-efficiently-at-the-edge-with-cloudflare-workers-and-vercel/)