Add auth framework research document

Comprehensive evaluation of 11 auth frameworks for PVM's
split-brain architecture. Recommends self-hosted Zitadel v3
for its Rust crate, OIDC JWKS for offline JWT validation on
RPi5 nodes, and zero-cost self-hosting on existing Hetzner PVE.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Mikkel Georgsen 2026-02-08 03:24:51 +01:00
parent 995a8123e6
commit e25afdcb3a

805
docs/AUTH_RESEARCH.md Normal file
View file

@ -0,0 +1,805 @@
# PVM Authentication Framework Research
> **Date:** 2025-02-08
> **Status:** Final
> **Author:** Research Agent (Claude)
---
## 1. Executive Summary
**Recommendation: Zitadel (self-hosted) + lightweight JWT validation on local nodes.**
After evaluating 11 authentication frameworks against PVM's unique split-brain architecture requirements, Zitadel emerges as the clear winner for these reasons:
1. **Official Rust/Axum crate** (`zitadel` on crates.io) with dedicated Axum middleware, introspection, and OIDC modules -- no other auth platform has this level of first-class Rust support.
2. **Official SvelteKit integration** via Auth.js with documented PKCE flow, maintained by the Zitadel team.
3. **Self-hosted on PostgreSQL** (v3+ requires PostgreSQL, dropping CockroachDB) -- PVM already uses PostgreSQL 16+, so Zitadel shares the same database engine with zero additional database infrastructure.
4. **Standard OIDC/OAuth2 with JWKS endpoint** -- the RPi5 local nodes cache the JWKS public keys and validate JWTs entirely offline. No auth server needed on the Pi.
5. **AGPL v3 license** -- fine for PVM since we use Zitadel as-is (not modifying its source code), and it runs as an independent service.
6. **Resource-efficient** -- runs on 512MB RAM + 1 CPU for test environments, 1-2GB RAM + 2-4 CPUs for production. Fits comfortably on Hetzner PVE.
7. **Full feature coverage** -- social login (Google, Apple, Facebook), email+password, phone+password, TOTP/MFA, passkeys, magic links, RBAC, admin console, audit logs.
8. **Free forever when self-hosted** -- no MAU limits, no feature gating on the self-hosted version.
**Runner-up: Ory (Kratos + Hydra)** -- more flexible but significantly more complex to operate (two services, custom UI required, manual integration between components).
**Third place: Keycloak** -- battle-tested but Java-based, heavy on resources (1.25GB+ RAM minimum), no Rust SDK, and requires more memory than Zitadel for equivalent workloads.
---
## 2. The Split-Brain Auth Challenge
### The Problem
PVM has a distributed architecture where a player's phone can talk to either:
- **The cloud** (Hetzner PVE) -- the primary SaaS backend
- **A local RPi5 node** at a poker venue -- for low latency and offline resilience
The local node may be offline for up to 72 hours. When online, it syncs via NATS JetStream. This creates a fundamental auth challenge:
```
Player Phone
|
|-- (mDNS discovery) --> RPi5 Local Node (may be offline)
|
|-- (internet) -------> Cloud SaaS (Hetzner PVE)
```
**Auth tokens issued by the cloud must be valid on the local node, and vice versa**, without the local node calling home to verify them.
### The Solution Pattern
The only viable approach for offline token validation is **asymmetric JWT signing with cached JWKS**:
1. **Zitadel runs on the cloud** (Hetzner PVE), issuing JWTs signed with RS256 (RSA) or ES256 (ECDSA) private keys.
2. **The JWKS (public keys) are published** at a standard `/.well-known/jwks.json` endpoint.
3. **Each RPi5 node caches the JWKS** when it syncs with the cloud. The cache is refreshed on every NATS sync cycle.
4. **When offline, the RPi5 validates JWTs** using only the cached public keys -- pure cryptographic verification, no network calls.
5. **Token refresh** happens against whichever endpoint is reachable (cloud or local). The local node can issue short-lived tokens that are also verifiable by the cloud (using the same or a federated key trust).
### Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Signing algorithm | RS256 or ES256 | Asymmetric -- public key can be distributed freely |
| Token format | JWT (access token) + opaque refresh token | JWTs are self-contained and verifiable offline |
| JWKS caching | On RPi5 via NATS sync | Ensures offline validation even after 72h |
| Token lifetime | Access: 15min, Refresh: 7 days | Short access tokens limit blast radius; refresh tokens cover offline periods |
| Auth server location | Cloud only | RPi5 does JWT validation only, not token issuance |
| Social login | Cloud only (OAuth requires internet) | Cloud issues PVM JWT after social auth completes |
---
## 3. Evaluation Matrix
### Scoring Key
- **A** = Excellent fit
- **B** = Good fit with minor gaps
- **C** = Usable but significant caveats
- **D** = Poor fit
- **F** = Does not work
| Framework | Rust SDK | SvelteKit | Self-Hosted | Free Tier | Social Login | MFA/2FA | JWT/JWKS | Resource Needs | Split-Brain Fit | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| **Zitadel** | A (official crate) | A (Auth.js example) | A (Docker+PG) | A (unlimited self-hosted) | A (Google/Apple/FB+) | A (TOTP, passkeys) | A (standard OIDC JWKS) | B (512MB-2GB) | A | **A** |
| **Ory (Kratos+Hydra)** | B (auto-gen SDK) | B (community kit) | A (Go binaries) | A (fully OSS) | A (via Kratos) | A (TOTP, WebAuthn) | A (Hydra JWKS) | A (lightweight Go) | A | **B+** |
| **Keycloak** | D (no SDK, REST API) | B (OIDC generic) | A (Docker) | A (fully OSS) | A (built-in) | A (TOTP, WebAuthn) | A (JWKS) | C (1.25GB+ RAM, Java) | A | **B** |
| **Logto** | D (no Rust SDK) | B (OIDC generic) | A (Docker, Node.js) | A (unlimited self-hosted) | A (20+ providers) | A (TOTP, passkeys) | A (OIDC JWKS) | B (~512MB-1GB) | A | **B** |
| **Authentik** | D (no SDK, REST/OIDC) | B (OIDC generic) | A (Docker) | A (fully OSS) | A (broad) | A (TOTP, WebAuthn) | A (OIDC JWKS) | C (2GB+ RAM, Python) | A | **B-** |
| **Auth0** | D (no SDK) | B (Auth.js) | F (cloud only) | B (25k MAU free) | A (built-in) | C (paid only) | A (JWKS) | N/A (managed) | B (vendor dep.) | **C+** |
| **Clerk** | C (community crate) | B (community svelte-clerk) | F (cloud only) | B (10k MAU free) | A (built-in) | A (built-in) | B (session tokens) | N/A (managed) | C (cloud-dependent) | **C** |
| **Supabase Auth** | D (no Rust SDK) | C (JS client) | B (GoTrue Docker) | B (50k MAU cloud) | A (built-in) | B (limited) | B (RS256 JWTs) | B (~512MB) | B (GoTrue only) | **C** |
| **SuperTokens** | D (no Rust SDK) | C (React SDK focus) | A (Docker core) | A (unlimited self-hosted) | A (built-in) | A (TOTP) | C (session-based, not JWT) | B (~1GB) | C (session model) | **C** |
| **Hanko** | D (no Rust SDK) | B (web components) | A (Docker, Go) | B (10k MAU cloud) | B (limited providers) | A (passkeys native) | B (OIDC) | A (lightweight Go) | B | **C+** |
| **Custom (Rust)** | A (own code) | A (own design) | A (embedded) | A (free) | C (build OAuth flows) | C (build TOTP) | A (jsonwebtoken crate) | A (no overhead) | A | **B-** |
---
## 4. Deep Dive: Top 3 Candidates
### 4.1 Zitadel (Recommended)
**What it is:** A cloud-native identity management platform written in Go, providing full OIDC/OAuth2, SAML, and LDAP support with a built-in admin console.
**Why it wins for PVM:**
**Rust Integration (Best-in-Class)**
The `zitadel` crate (v5.5+) provides:
- `zitadel::axum` module with middleware for token introspection
- `zitadel::oidc` for OpenID Connect discovery and token validation
- `introspection_cache` feature flag for caching OIDC discovery and introspection results
- Feature flags: `axum`, `oidc`, `credentials`, `api`, `introspection_cache`
```toml
# Cargo.toml
[dependencies]
zitadel = { version = "5", features = ["axum", "oidc", "introspection_cache"] }
```
**SvelteKit Integration (Official)**
Zitadel maintains an [official example](https://github.com/zitadel/example-auth-sveltekit) using `@auth/sveltekit` with:
- PKCE authorization code flow
- Automatic token refresh
- Server-side session management via SvelteKit load functions
- Federated logout with CSRF protection
**Self-Hosting (Simple)**
- Single Go binary + PostgreSQL (PVM already has PG 16+)
- Docker Compose deployment in minutes
- v3+ requires PostgreSQL only (dropped CockroachDB)
- Resource needs: 512MB RAM (test), 1-2GB RAM + 2-4 CPUs (production)
**Feature Completeness:**
- Social login: Google, Apple, Facebook, GitHub, GitLab, Microsoft, and more
- Email + password with customizable policies
- Phone number authentication
- TOTP, passkeys/FIDO2, email/SMS OTP
- Magic links / passwordless
- Built-in admin console (web UI)
- Multi-tenancy with organizations
- RBAC with roles and permissions
- Unlimited audit trail
- Branding and custom login pages
- Account linking across providers
**Licensing:**
- AGPL v3 as of Zitadel v3 (March 2025)
- Using Zitadel as an identity service without modifying its source code is fine for commercial use
- SDKs and Protocol Buffer definitions remain Apache 2.0
- A commercial license is available if AGPL is incompatible
**Limitations:**
- AGPL may concern some organizations (but not PVM's use case)
- The Rust crate's introspection module requires network access to Zitadel for token introspection (but we use JWKS validation instead on the RPi5, which is offline-capable)
- Resource usage spikes during password hashing (4 CPU cores recommended for production)
---
### 4.2 Ory (Kratos + Hydra) -- Runner-Up
**What it is:** A suite of Go microservices -- Kratos for identity management, Hydra for OAuth2/OIDC, Keto for permissions, Oathkeeper for API gateway auth.
**Why it's strong:**
- Written in Go, lightweight binaries (5-15MB each), low resource usage
- Kratos handles registration, login, MFA, social login, account recovery
- Hydra is OpenID Certified and handles OAuth2 + JWKS endpoint
- Auto-generated Rust SDK for both Kratos and Hydra APIs
- Fully open source (Apache 2.0 license)
- Can scale to billions of users (used by OpenAI per their claims)
**Why it loses to Zitadel for PVM:**
- **Operational complexity:** You need to run Kratos AND Hydra as separate services, configure them to work together, and build a custom login/consent UI. This is significant engineering overhead.
- **No built-in admin UI:** You must build or find a third-party admin interface.
- **SvelteKit integration:** Only community examples exist (ory-kit by MarkusThielker), and development on the SvelteKit UI has stopped.
- **Rust SDK is auto-generated:** Works but lacks the ergonomics and Axum-specific middleware of Zitadel's crate.
- **Documentation complexity:** Setting up Kratos + Hydra together requires deep understanding of OAuth2 flows and significant configuration.
**Resource requirements:** Very lightweight. Kratos idles at ~15-380MB depending on configuration. Hydra is similarly lean. Total for both: 256MB-1GB RAM.
**Best for:** Teams that want maximum flexibility and are willing to invest in custom UI development and operational complexity.
---
### 4.3 Keycloak -- Third Place
**What it is:** The industry-standard open-source identity management platform, backed by Red Hat/JBoss, written in Java.
**Why it's considered:**
- Most battle-tested solution in the market (used by thousands of enterprises)
- Full OIDC/OAuth2/SAML support with standard JWKS endpoints
- Built-in admin console, user management, social login, MFA
- Extensive documentation and community
- FAPI 2.0 compliant (Keycloak 26.4+)
- JWT Authorization Grant (RFC 7523) in Keycloak 26.5
**Why it loses for PVM:**
- **Java-based, resource-heavy:** Minimum 750MB RAM for a bare container, recommended 2GB for production. PVM's Hetzner PVE resources are better spent elsewhere.
- **No Rust SDK:** You'd use generic OIDC/JWT validation crates. The REST admin API works but has no Rust client.
- **Slower startup:** Java cold starts are measured in seconds, not milliseconds.
- **Overkill for PVM:** Enterprise features like SAML, LDAP federation, and Kerberos add complexity without value for a poker venue SaaS.
- **Theme customization:** Uses FreeMarker templates, which have a steep learning curve.
**Resource requirements:** 1.25GB RAM base (including caches), recommended 2GB+ for production. 1-2 CPU cores minimum.
**Best for:** Enterprises with existing Java infrastructure that need SAML/LDAP federation.
---
## 5. Recommended Architecture
### Overview
```
CLOUD (Hetzner PVE)
┌──────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌──────────────────┐ │
│ │ Zitadel │ │ PVM Cloud API │ │
│ │ (Auth) │◄───►│ (Rust/Axum) │ │
│ │ │ │ │ │
│ │ PG DB │ │ PG DB │ │
│ └────┬────┘ └────────┬─────────┘ │
│ │ │ │
│ │ JWKS endpoint │ NATS │
│ │ /.well-known/ │ JetStream │
│ │ jwks.json │ │
└───────┼───────────────────┼──────────────┘
│ │
│ │
════════╪═══════════════════╪═══════ INTERNET
│ │
▼ ▼
┌──────────────────────────────────────────┐
│ RPi5 LOCAL NODE │
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ Cached JWKS │ │ PVM Local API │ │
│ │ (public keys)│◄─│ (Rust binary) │ │
│ │ │ │ │ │
│ │ Updated via │ │ libSQL DB │ │
│ │ NATS sync │ │ │ │
│ └──────────────┘ │ NATS leaf node │ │
│ └─────────────────┘ │
└──────────────────────────────────────────┘
│ mDNS discovery + local API calls
┌───────┴──────┐
│ Player Phone │
│ (SvelteKit) │
└──────────────┘
```
### Auth Flow: Registration & Login (Cloud)
```
1. Player opens PVM app (SvelteKit)
2. App detects network connectivity --> routes to cloud
3. Player chooses: email+password, phone+password, or social login (Google/Apple/Facebook)
4. SvelteKit redirects to Zitadel login page (OIDC Authorization Code + PKCE)
5. Zitadel handles the auth flow (including social OAuth if applicable)
6. Zitadel issues:
- Access token (JWT, signed RS256, 15min expiry)
- Refresh token (opaque, 7-day expiry)
- ID token (JWT with user claims)
7. SvelteKit stores tokens (httpOnly cookies for SSR, secure storage for SPA)
8. Cloud API validates JWT on each request using Zitadel's JWKS
```
### Auth Flow: Local Node (Offline-Capable)
```
1. Player phone discovers RPi5 via mDNS
2. Phone sends request to local API with existing JWT (from cloud login)
3. RPi5 Rust binary validates JWT:
a. Parse JWT header to get key ID (kid)
b. Look up public key in cached JWKS (stored in libSQL or memory)
c. Verify RS256 signature
d. Validate claims (exp, iss, aud, sub)
4. If JWT is expired but refresh token is available:
a. If cloud is reachable: refresh against Zitadel
b. If offline: issue a short-lived local token (signed with the node's key)
- The cloud trusts the node's public key (registered during provisioning)
5. Request is authenticated; proceed with venue operations
```
### Auth Flow: Token Refresh Strategy
```
Token Refresh Decision Tree:
├── Cloud reachable?
│ ├── YES: Refresh against Zitadel (standard OIDC refresh)
│ │ └── New access token (15min) + new refresh token (7 days)
│ │
│ └── NO: Is the refresh token still valid (< 7 days)?
│ ├── YES: Local node issues a "bridge token"
│ │ - Signed with node's key pair
│ │ - Short-lived (30 min)
│ │ - Contains original user claims from the expired JWT
│ │ - Marked with a "local_issued" claim
│ │
│ └── NO: User must re-authenticate when cloud is reachable
│ (graceful degradation -- show "offline mode limited")
```
### JWKS Sync Strategy
```
1. On RPi5 boot / NATS reconnect:
- Fetch JWKS from Zitadel's /.well-known/jwks.json
- Store in libSQL (jwks table) with timestamp
- Also cache in memory (HashMap<kid, DecodingKey>)
2. Periodic refresh (every 1 hour while connected):
- Re-fetch JWKS
- Compare with cached version
- Update if changed (key rotation support)
3. Via NATS JetStream:
- Cloud publishes "jwks.updated" event on key rotation
- RPi5 subscribes and refreshes immediately
4. Offline fallback:
- Use last cached JWKS (stored in libSQL)
- Valid for up to 72 hours (matches offline window)
- Include 2-3 previous key versions to handle rotation during offline period
```
### Node Key Trust Model
```
1. RPi5 provisioning:
- Node generates its own RS256 key pair on first boot
- Public key is registered with the cloud PVM API via NATS
- Cloud stores node public keys in its database
2. Local token issuance (offline refresh):
- Node signs "bridge tokens" with its private key
- Token includes: original user sub, node_id, "local_issued" flag
- When cloud comes back online, it can verify these tokens
using the registered node public key
3. Cloud verification of local tokens:
- Check node_id claim
- Look up node's public key
- Verify signature
- Apply stricter authorization (local tokens get fewer permissions)
```
---
## 6. Implementation Considerations
### 6.1 Rust/Axum Backend (Cloud)
**Dependencies:**
```toml
[dependencies]
# Zitadel integration (cloud API)
zitadel = { version = "5", features = ["axum", "oidc", "introspection_cache"] }
# For the RPi5 local node (standalone JWT validation)
jsonwebtoken = "9" # JWT creation and validation
axum-jwt-auth = "0.4" # Axum middleware for JWT with JWKS
# Supporting crates
serde = { version = "1", features = ["derive"] }
serde_json = "1"
reqwest = { version = "0.12", features = ["json"] } # For JWKS fetching
```
**Cloud API: Token validation with Zitadel crate**
```rust
use zitadel::axum::introspection::{IntrospectedUser, IntrospectionStateBuilder};
use axum::{Router, routing::get, extract::State};
// Option A: Use Zitadel's introspection (requires Zitadel to be reachable)
async fn protected_handler(user: IntrospectedUser) -> String {
format!("Hello, {}!", user.username)
}
// Option B: Use standalone JWKS validation (works offline too)
// This is what the RPi5 uses, but the cloud can use it as well
use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm};
fn validate_jwt(token: &str, jwks: &CachedJwks) -> Result<Claims, AuthError> {
let header = jsonwebtoken::decode_header(token)?;
let kid = header.kid.ok_or(AuthError::MissingKid)?;
let key = jwks.get_key(&kid).ok_or(AuthError::UnknownKey)?;
let validation = Validation::new(Algorithm::RS256);
let token_data = decode::<Claims>(token, key, &validation)?;
Ok(token_data.claims)
}
```
**RPi5 Local Node: Offline JWT validation**
```rust
use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm, jwk::JwkSet};
struct JwksCache {
keys: HashMap<String, DecodingKey>,
last_updated: DateTime<Utc>,
}
impl JwksCache {
/// Load JWKS from libSQL on startup
async fn from_libsql(db: &Database) -> Result<Self> {
let row = db.query("SELECT jwks_json, updated_at FROM jwks_cache ORDER BY updated_at DESC LIMIT 1").await?;
let jwks: JwkSet = serde_json::from_str(&row.jwks_json)?;
let keys = jwks.keys.iter()
.filter_map(|jwk| {
let kid = jwk.common.key_id.as_ref()?;
let key = DecodingKey::from_jwk(jwk).ok()?;
Some((kid.clone(), key))
})
.collect();
Ok(Self { keys, last_updated: row.updated_at })
}
/// Refresh from Zitadel (when online)
async fn refresh(&mut self, zitadel_url: &str) -> Result<()> {
let jwks_url = format!("{}/.well-known/jwks.json", zitadel_url);
let jwks: JwkSet = reqwest::get(&jwks_url).await?.json().await?;
// Store in libSQL for offline use
self.store_in_libsql(&jwks).await?;
// Update in-memory cache
self.update_keys(&jwks);
Ok(())
}
fn validate(&self, token: &str) -> Result<Claims> {
let header = jsonwebtoken::decode_header(token)?;
let kid = header.kid.as_ref().ok_or(AuthError::MissingKid)?;
let key = self.keys.get(kid).ok_or(AuthError::UnknownKey)?;
let mut validation = Validation::new(Algorithm::RS256);
validation.set_issuer(&["https://auth.pvm.example.com"]);
validation.set_audience(&["pvm-api"]);
let data = decode::<Claims>(token, key, &validation)?;
Ok(data.claims)
}
}
```
### 6.2 SvelteKit Frontend
**Dependencies:**
```bash
npm install @auth/sveltekit @auth/core
```
**Auth.js configuration with Zitadel:**
```typescript
// src/auth.ts
import { SvelteKitAuth } from "@auth/sveltekit";
import Zitadel from "@auth/core/providers/zitadel";
export const { handle, signIn, signOut } = SvelteKitAuth({
providers: [
Zitadel({
issuer: "https://auth.pvm.example.com",
clientId: env.ZITADEL_CLIENT_ID,
clientSecret: env.ZITADEL_CLIENT_SECRET,
authorization: {
params: {
scope: "openid profile email",
},
},
}),
],
callbacks: {
async jwt({ token, account }) {
if (account) {
token.accessToken = account.access_token;
token.refreshToken = account.refresh_token;
token.expiresAt = account.expires_at;
}
return token;
},
async session({ session, token }) {
session.accessToken = token.accessToken;
return session;
},
},
});
```
**Route protection:**
```typescript
// src/routes/venue/+page.server.ts
import { redirect } from "@sveltejs/kit";
import type { PageServerLoad } from "./$types";
export const load: PageServerLoad = async (event) => {
const session = await event.locals.auth();
if (!session) {
throw redirect(303, "/auth/signin");
}
return { session };
};
```
**Dual API client (cloud vs. local):**
```typescript
// src/lib/api-client.ts
import { browser } from "$app/environment";
class PvmApiClient {
private cloudUrl: string;
private localUrl: string | null = null;
constructor(cloudUrl: string) {
this.cloudUrl = cloudUrl;
}
// Set when mDNS discovers a local node
setLocalNode(url: string) {
this.localUrl = url;
}
async fetch(path: string, token: string, options?: RequestInit) {
// Try local first (lower latency), fall back to cloud
if (this.localUrl) {
try {
const res = await fetch(`${this.localUrl}${path}`, {
...options,
headers: { Authorization: `Bearer ${token}`, ...options?.headers },
signal: AbortSignal.timeout(2000), // 2s timeout for local
});
if (res.ok) return res;
} catch {
// Local node unreachable, fall through to cloud
}
}
return fetch(`${this.cloudUrl}${path}`, {
...options,
headers: { Authorization: `Bearer ${token}`, ...options?.headers },
});
}
}
```
### 6.3 Zitadel Deployment on Hetzner PVE
**Docker Compose (production):**
```yaml
version: "3.8"
services:
zitadel:
image: ghcr.io/zitadel/zitadel:v3-latest
command: start-from-init --masterkey "${ZITADEL_MASTERKEY}"
environment:
ZITADEL_DATABASE_POSTGRES_HOST: postgres
ZITADEL_DATABASE_POSTGRES_PORT: 5432
ZITADEL_DATABASE_POSTGRES_DATABASE: zitadel
ZITADEL_DATABASE_POSTGRES_USER_USERNAME: zitadel
ZITADEL_DATABASE_POSTGRES_USER_PASSWORD: "${ZITADEL_DB_PASSWORD}"
ZITADEL_EXTERNALDOMAIN: auth.pvm.example.com
ZITADEL_EXTERNALPORT: 443
ZITADEL_EXTERNALSECURE: "true"
ZITADEL_TLS_MODE: external # TLS terminated at reverse proxy
ports:
- "8080:8080"
depends_on:
postgres:
condition: service_healthy
deploy:
resources:
limits:
memory: 2G
cpus: "4"
reservations:
memory: 512M
cpus: "1"
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: zitadel
POSTGRES_USER: zitadel
POSTGRES_PASSWORD: "${ZITADEL_DB_PASSWORD}"
volumes:
- zitadel-pg-data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U zitadel"]
interval: 5s
timeout: 5s
retries: 5
deploy:
resources:
limits:
memory: 1G
cpus: "2"
volumes:
zitadel-pg-data:
```
**Estimated resource usage on Hetzner PVE:**
- Zitadel: 512MB-2GB RAM, 1-4 CPU cores
- PostgreSQL (shared or dedicated): 256MB-1GB RAM
- **Total: ~1-3GB RAM** for the auth stack
---
## 7. Security Considerations
### 7.1 Token Security
| Concern | Mitigation |
|---|---|
| Token theft | Short-lived access tokens (15min). Refresh tokens are opaque and stored server-side by Zitadel. |
| Token replay | Include `iat` (issued-at) and `jti` (JWT ID) claims. Local nodes can maintain a small replay cache. |
| Key compromise (cloud) | Zitadel supports key rotation. JWKS cache on RPi5 auto-updates. Revoke compromised keys immediately. |
| Key compromise (node) | Each node has its own key pair. Revoke a single node's key without affecting others. |
| Offline token abuse | "Bridge tokens" issued by local nodes are short-lived (30min) and carry reduced permissions. |
| JWKS cache staleness | 72h maximum offline window. Keys should have longer lifetimes than this. Include previous key versions in cache. |
### 7.2 Social Login Security
- All social OAuth flows terminate at the cloud (Zitadel handles the redirect dance)
- Zitadel validates social provider tokens and issues PVM JWTs
- The local node never sees social provider tokens -- only PVM JWTs
- PKCE is used for all authorization code flows (prevents code interception)
### 7.3 MFA Considerations
- TOTP enrollment happens via Zitadel (cloud)
- TOTP verification can work offline IF the local node has the user's TOTP secret (synced via NATS)
- **Recommendation:** For simplicity, require MFA only for sensitive operations routed to the cloud. Venue check-in at a local node uses standard JWT validation without MFA step-up.
- Passkeys/FIDO2 require the authenticator device, which is local to the user's phone -- works offline
### 7.4 AGPL License Risk Assessment
| Scenario | Risk |
|---|---|
| Using Zitadel as-is (our case) | No risk. AGPL allows use as a service without source disclosure. |
| Modifying Zitadel source code | Must share modifications under AGPL. Avoid this -- use Zitadel's extension points instead. |
| Linking Zitadel libraries in PVM code | The SDKs are Apache 2.0, so no issue. |
| Distributing Zitadel binary | Must provide source. Not our case -- we self-host only. |
### 7.5 Threat Model for Split-Brain Auth
**Threat:** Attacker compromises an RPi5 node and extracts the JWKS cache.
**Impact:** Low. JWKS contains only public keys. Attacker cannot forge tokens.
**Threat:** Attacker compromises an RPi5 node and extracts the node's private key.
**Impact:** Medium. Attacker can forge "bridge tokens" for that node. Mitigation: revoke the node's key via cloud, notify affected users.
**Threat:** Attacker presents a valid cloud JWT to a local node after the user's account is disabled in the cloud.
**Impact:** Medium. The local node cannot check account status while offline. Mitigation: short token lifetimes (15min), and process account revocations on next NATS sync.
**Threat:** Replay attack with expired token during offline period.
**Impact:** Low. JWT `exp` claim is always checked. Expired tokens are rejected regardless of network state.
---
## 8. Alternatives Considered (Detail)
### 8.1 Auth0
- **Free tier:** 25,000 MAU (B2C), 500 MAU (B2B)
- **Pros:** Excellent documentation, many SDKs, built-in social login
- **Cons:** Cloud-only (no self-hosting), no MFA on free tier, expensive at scale ($240/mo for pro), vendor lock-in risks for split-brain architecture. The PVM local node would depend on cached JWKS from Auth0's cloud endpoint -- any Auth0 outage affects token validation.
- **Verdict:** Vendor dependency is unacceptable for an offline-first architecture.
### 8.2 Clerk
- **Free tier:** 10,000 MAU, 100 organizations
- **Pros:** Great DX, community Rust SDK (`clerk-rs`), community SvelteKit SDK (`svelte-clerk`)
- **Cons:** Cloud-only, session-based (not pure JWT), the Rust SDK is community-maintained with uncertain longevity. No self-hosting option means complete vendor dependency.
- **Verdict:** Cloud-only with session-based auth is fundamentally incompatible with offline local nodes.
### 8.3 Supabase Auth (GoTrue)
- **Free tier:** 50,000 MAU (cloud), unlimited self-hosted
- **Pros:** Simple JWT-based auth, supports RS256, lightweight GoTrue binary
- **Cons:** No Rust SDK, primarily designed as part of the Supabase ecosystem. Self-hosting GoTrue independently requires running it separate from Supabase, which is poorly documented. Limited social provider configuration. No admin UI when self-hosted standalone.
- **Verdict:** Too tightly coupled to the Supabase ecosystem. Could work as a lightweight option but lacks the identity management features PVM needs.
### 8.4 Logto
- **Free tier:** 50,000 MAU (cloud), unlimited self-hosted
- **Pros:** Modern UI, good documentation, OIDC/OAuth 2.1 compliant, RBAC built-in, 20+ social providers
- **Cons:** No Rust SDK (would need to use generic OIDC/JWT crates), Node.js-based (heavier than Go alternatives), relatively young project. SvelteKit support via generic OIDC.
- **Verdict:** Strong contender but loses to Zitadel on Rust integration. If Zitadel didn't have its Rust crate, Logto would be the top pick.
### 8.5 SuperTokens
- **Free tier:** Unlimited self-hosted (open source features), 5,000 MAU cloud
- **Pros:** Self-hosted is fully free, good documentation, session management with anti-CSRF
- **Cons:** No Rust SDK (Node.js, Python, Go only), session-based rather than JWT-focused (would need to run a SuperTokens sidecar), requires SuperTokens core Java service alongside your backend.
- **Verdict:** Session-based model doesn't fit split-brain offline validation. Running a Java core service adds unwanted complexity.
### 8.6 Hanko
- **Free tier:** 10,000 MAU (cloud), unlimited self-hosted (AGPL)
- **Pros:** Passkey-first (great future-proofing), lightweight Go binary, simple API, web components for frontend
- **Cons:** No Rust SDK, limited social login providers compared to Zitadel, smaller community, AGPL license (same as Zitadel). Passkey-first approach may frustrate users who prefer passwords.
- **Verdict:** Interesting for passkey-first apps but too narrow for PVM's diverse auth needs (social login, email+password, phone+password).
### 8.7 Authentik
- **Free tier:** Unlimited self-hosted (open source)
- **Pros:** Full-featured IdP, great admin UI, OIDC/OAuth2/SAML/LDAP/RADIUS support, active development
- **Cons:** No Rust SDK, Python/Django-based (2GB+ RAM minimum), heavier than Go-based alternatives. Designed primarily as a reverse-proxy auth provider for self-hosted services (Plex, Grafana, etc.), not as an embeddable auth API.
- **Verdict:** Excellent for homelab SSO but over-resourced and architecturally mismatched for PVM's API-first needs.
### 8.8 Building Custom Auth in Rust
**Available crates:**
- `jsonwebtoken` (v9) -- JWT creation and validation, RS256/ES256/EdDSA
- `oauth2` -- OAuth2 client flows
- `totp-rs` -- TOTP generation and validation
- `argon2` / `password-auth` -- Password hashing (Argon2id, OWASP recommended params)
- `axum-jwt-auth` -- Axum middleware for JWT with JWKS
- `openidconnect` -- Full OIDC client library
**Estimated effort:** 4-8 weeks for a full auth system with social login, password auth, MFA, session management, account recovery, email verification, and admin UI.
**Risks:**
- Auth is a security-critical system; bugs lead to breaches
- Ongoing maintenance burden (security patches, protocol updates)
- Social login requires implementing OAuth2 flows for each provider
- Account recovery, email verification, and brute force protection all need custom implementation
- No admin UI out of the box
**Verdict:** The Rust ecosystem has excellent building blocks, but assembling them into a production auth system is a multi-month effort that Zitadel provides out of the box. The split-brain JWT validation part IS worth building custom (it's just `jsonwebtoken` + a JWKS cache), but the full identity management should be delegated to Zitadel.
---
## 9. Final Recommendation & Next Steps
### Architecture Decision
| Component | Solution |
|---|---|
| Identity Provider | Zitadel v3 (self-hosted on Hetzner PVE) |
| Cloud API auth | Zitadel Rust crate (`zitadel::axum`) for introspection OR standalone JWKS validation |
| Local node auth | Custom JWT validation using `jsonwebtoken` crate + cached JWKS |
| Frontend auth | `@auth/sveltekit` with Zitadel OIDC provider |
| JWKS sync | NATS JetStream + periodic HTTP fetch |
| Token format | RS256 JWTs (access), opaque refresh tokens |
| Database | PostgreSQL 16 (shared with or separate from PVM's main database) |
### Implementation Order
1. **Week 1:** Deploy Zitadel on Hetzner PVE (Docker Compose + PostgreSQL). Configure social providers (Google, Apple, Facebook). Set up email+password and phone auth.
2. **Week 2:** Integrate SvelteKit frontend with Zitadel using `@auth/sveltekit`. Build login/signup flows. Test PKCE authorization code flow.
3. **Week 3:** Integrate Cloud Rust API with Zitadel. Use `zitadel::axum` for token validation. Implement user context extraction from JWT claims.
4. **Week 4:** Build JWKS caching on RPi5. Implement offline JWT validation with `jsonwebtoken`. Set up NATS-based JWKS sync. Test offline scenarios.
5. **Week 5:** Implement "bridge token" issuance on RPi5 for offline token refresh. Register node public keys with cloud. Test full split-brain auth flow.
6. **Week 6:** Enable MFA (TOTP). Configure branding and custom login pages. Security review and penetration testing.
### Cost Estimate
| Item | Cost |
|---|---|
| Zitadel (self-hosted) | $0 |
| PostgreSQL (already in stack) | $0 |
| Hetzner resources (incremental) | ~5-10 EUR/month for 2GB RAM + 2 CPU LXC |
| Social login API keys | $0 (Google, Apple, Facebook all free) |
| **Total** | **~5-10 EUR/month** |
---
## Appendix: Sources
- [Zitadel GitHub](https://github.com/zitadel/zitadel)
- [Zitadel Pricing](https://zitadel.com/pricing/detail)
- [Zitadel Rust Crate (docs.rs)](https://docs.rs/zitadel/latest/zitadel/)
- [Zitadel SvelteKit Example](https://github.com/zitadel/example-auth-sveltekit)
- [Zitadel Self-Hosting Specs](https://help.zitadel.com/what-are-zitadel-minimum-self-hosted-specs)
- [Zitadel v3 Announcement (AGPL, PostgreSQL)](https://zitadel.com/blog/zitadel-v3-announcement)
- [Zitadel License FAQ](https://zitadel.com/license-faq)
- [Ory Kratos GitHub](https://github.com/ory/kratos)
- [Ory Hydra GitHub](https://github.com/ory/hydra)
- [Ory Kratos Rust SDK](https://github.com/ory/kratos-client-rust)
- [Keycloak 26.5 Release](https://www.keycloak.org/2026/01/keycloak-2650-released)
- [Keycloak Memory Sizing](https://www.keycloak.org/high-availability/concepts-memory-and-cpu-sizing)
- [Auth0 Pricing](https://auth0.com/pricing)
- [Clerk Pricing](https://clerk.com/pricing)
- [Clerk Rust SDK](https://github.com/DarrenBaldwin07/clerk-rs)
- [Logto Pricing](https://logto.io/pricing)
- [Logto GitHub](https://github.com/logto-io/logto)
- [SuperTokens Pricing](https://supertokens.com/pricing)
- [Hanko GitHub](https://github.com/teamhanko/hanko)
- [Hanko Pricing](https://www.hanko.io/pricing)
- [Authentik GitHub](https://github.com/goauthentik/authentik)
- [Authentik Pricing](https://goauthentik.io/pricing/)
- [Supabase Auth GitHub](https://github.com/supabase/auth)
- [Supabase Pricing](https://supabase.com/pricing)
- [jsonwebtoken Rust Crate](https://github.com/Keats/jsonwebtoken)
- [axum-jwt-auth Crate](https://crates.io/crates/axum-jwt-auth)
- [Edge JWT Validation Patterns](https://securityboulevard.com/2025/11/how-to-validate-jwts-efficiently-at-the-edge-with-cloudflare-workers-and-vercel/)