Add auth framework research document
Comprehensive evaluation of 11 auth frameworks for PVM's split-brain architecture. Recommends self-hosted Zitadel v3 for its Rust crate, OIDC JWKS for offline JWT validation on RPi5 nodes, and zero-cost self-hosting on existing Hetzner PVE. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
995a8123e6
commit
e25afdcb3a
1 changed files with 805 additions and 0 deletions
805
docs/AUTH_RESEARCH.md
Normal file
805
docs/AUTH_RESEARCH.md
Normal file
|
|
@ -0,0 +1,805 @@
|
|||
# PVM Authentication Framework Research
|
||||
|
||||
> **Date:** 2025-02-08
|
||||
> **Status:** Final
|
||||
> **Author:** Research Agent (Claude)
|
||||
|
||||
---
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
**Recommendation: Zitadel (self-hosted) + lightweight JWT validation on local nodes.**
|
||||
|
||||
After evaluating 11 authentication frameworks against PVM's unique split-brain architecture requirements, Zitadel emerges as the clear winner for these reasons:
|
||||
|
||||
1. **Official Rust/Axum crate** (`zitadel` on crates.io) with dedicated Axum middleware, introspection, and OIDC modules -- no other auth platform has this level of first-class Rust support.
|
||||
2. **Official SvelteKit integration** via Auth.js with documented PKCE flow, maintained by the Zitadel team.
|
||||
3. **Self-hosted on PostgreSQL** (v3+ requires PostgreSQL, dropping CockroachDB) -- PVM already uses PostgreSQL 16+, so Zitadel shares the same database engine with zero additional database infrastructure.
|
||||
4. **Standard OIDC/OAuth2 with JWKS endpoint** -- the RPi5 local nodes cache the JWKS public keys and validate JWTs entirely offline. No auth server needed on the Pi.
|
||||
5. **AGPL v3 license** -- fine for PVM since we use Zitadel as-is (not modifying its source code), and it runs as an independent service.
|
||||
6. **Resource-efficient** -- runs on 512MB RAM + 1 CPU for test environments, 1-2GB RAM + 2-4 CPUs for production. Fits comfortably on Hetzner PVE.
|
||||
7. **Full feature coverage** -- social login (Google, Apple, Facebook), email+password, phone+password, TOTP/MFA, passkeys, magic links, RBAC, admin console, audit logs.
|
||||
8. **Free forever when self-hosted** -- no MAU limits, no feature gating on the self-hosted version.
|
||||
|
||||
**Runner-up: Ory (Kratos + Hydra)** -- more flexible but significantly more complex to operate (two services, custom UI required, manual integration between components).
|
||||
|
||||
**Third place: Keycloak** -- battle-tested but Java-based, heavy on resources (1.25GB+ RAM minimum), no Rust SDK, and requires more memory than Zitadel for equivalent workloads.
|
||||
|
||||
---
|
||||
|
||||
## 2. The Split-Brain Auth Challenge
|
||||
|
||||
### The Problem
|
||||
|
||||
PVM has a distributed architecture where a player's phone can talk to either:
|
||||
- **The cloud** (Hetzner PVE) -- the primary SaaS backend
|
||||
- **A local RPi5 node** at a poker venue -- for low latency and offline resilience
|
||||
|
||||
The local node may be offline for up to 72 hours. When online, it syncs via NATS JetStream. This creates a fundamental auth challenge:
|
||||
|
||||
```
|
||||
Player Phone
|
||||
|
|
||||
|-- (mDNS discovery) --> RPi5 Local Node (may be offline)
|
||||
|
|
||||
|-- (internet) -------> Cloud SaaS (Hetzner PVE)
|
||||
```
|
||||
|
||||
**Auth tokens issued by the cloud must be valid on the local node, and vice versa**, without the local node calling home to verify them.
|
||||
|
||||
### The Solution Pattern
|
||||
|
||||
The only viable approach for offline token validation is **asymmetric JWT signing with cached JWKS**:
|
||||
|
||||
1. **Zitadel runs on the cloud** (Hetzner PVE), issuing JWTs signed with RS256 (RSA) or ES256 (ECDSA) private keys.
|
||||
2. **The JWKS (public keys) are published** at a standard `/.well-known/jwks.json` endpoint.
|
||||
3. **Each RPi5 node caches the JWKS** when it syncs with the cloud. The cache is refreshed on every NATS sync cycle.
|
||||
4. **When offline, the RPi5 validates JWTs** using only the cached public keys -- pure cryptographic verification, no network calls.
|
||||
5. **Token refresh** happens against whichever endpoint is reachable (cloud or local). The local node can issue short-lived tokens that are also verifiable by the cloud (using the same or a federated key trust).
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|---|---|---|
|
||||
| Signing algorithm | RS256 or ES256 | Asymmetric -- public key can be distributed freely |
|
||||
| Token format | JWT (access token) + opaque refresh token | JWTs are self-contained and verifiable offline |
|
||||
| JWKS caching | On RPi5 via NATS sync | Ensures offline validation even after 72h |
|
||||
| Token lifetime | Access: 15min, Refresh: 7 days | Short access tokens limit blast radius; refresh tokens cover offline periods |
|
||||
| Auth server location | Cloud only | RPi5 does JWT validation only, not token issuance |
|
||||
| Social login | Cloud only (OAuth requires internet) | Cloud issues PVM JWT after social auth completes |
|
||||
|
||||
---
|
||||
|
||||
## 3. Evaluation Matrix
|
||||
|
||||
### Scoring Key
|
||||
- **A** = Excellent fit
|
||||
- **B** = Good fit with minor gaps
|
||||
- **C** = Usable but significant caveats
|
||||
- **D** = Poor fit
|
||||
- **F** = Does not work
|
||||
|
||||
| Framework | Rust SDK | SvelteKit | Self-Hosted | Free Tier | Social Login | MFA/2FA | JWT/JWKS | Resource Needs | Split-Brain Fit | Overall |
|
||||
|---|---|---|---|---|---|---|---|---|---|---|
|
||||
| **Zitadel** | A (official crate) | A (Auth.js example) | A (Docker+PG) | A (unlimited self-hosted) | A (Google/Apple/FB+) | A (TOTP, passkeys) | A (standard OIDC JWKS) | B (512MB-2GB) | A | **A** |
|
||||
| **Ory (Kratos+Hydra)** | B (auto-gen SDK) | B (community kit) | A (Go binaries) | A (fully OSS) | A (via Kratos) | A (TOTP, WebAuthn) | A (Hydra JWKS) | A (lightweight Go) | A | **B+** |
|
||||
| **Keycloak** | D (no SDK, REST API) | B (OIDC generic) | A (Docker) | A (fully OSS) | A (built-in) | A (TOTP, WebAuthn) | A (JWKS) | C (1.25GB+ RAM, Java) | A | **B** |
|
||||
| **Logto** | D (no Rust SDK) | B (OIDC generic) | A (Docker, Node.js) | A (unlimited self-hosted) | A (20+ providers) | A (TOTP, passkeys) | A (OIDC JWKS) | B (~512MB-1GB) | A | **B** |
|
||||
| **Authentik** | D (no SDK, REST/OIDC) | B (OIDC generic) | A (Docker) | A (fully OSS) | A (broad) | A (TOTP, WebAuthn) | A (OIDC JWKS) | C (2GB+ RAM, Python) | A | **B-** |
|
||||
| **Auth0** | D (no SDK) | B (Auth.js) | F (cloud only) | B (25k MAU free) | A (built-in) | C (paid only) | A (JWKS) | N/A (managed) | B (vendor dep.) | **C+** |
|
||||
| **Clerk** | C (community crate) | B (community svelte-clerk) | F (cloud only) | B (10k MAU free) | A (built-in) | A (built-in) | B (session tokens) | N/A (managed) | C (cloud-dependent) | **C** |
|
||||
| **Supabase Auth** | D (no Rust SDK) | C (JS client) | B (GoTrue Docker) | B (50k MAU cloud) | A (built-in) | B (limited) | B (RS256 JWTs) | B (~512MB) | B (GoTrue only) | **C** |
|
||||
| **SuperTokens** | D (no Rust SDK) | C (React SDK focus) | A (Docker core) | A (unlimited self-hosted) | A (built-in) | A (TOTP) | C (session-based, not JWT) | B (~1GB) | C (session model) | **C** |
|
||||
| **Hanko** | D (no Rust SDK) | B (web components) | A (Docker, Go) | B (10k MAU cloud) | B (limited providers) | A (passkeys native) | B (OIDC) | A (lightweight Go) | B | **C+** |
|
||||
| **Custom (Rust)** | A (own code) | A (own design) | A (embedded) | A (free) | C (build OAuth flows) | C (build TOTP) | A (jsonwebtoken crate) | A (no overhead) | A | **B-** |
|
||||
|
||||
---
|
||||
|
||||
## 4. Deep Dive: Top 3 Candidates
|
||||
|
||||
### 4.1 Zitadel (Recommended)
|
||||
|
||||
**What it is:** A cloud-native identity management platform written in Go, providing full OIDC/OAuth2, SAML, and LDAP support with a built-in admin console.
|
||||
|
||||
**Why it wins for PVM:**
|
||||
|
||||
**Rust Integration (Best-in-Class)**
|
||||
The `zitadel` crate (v5.5+) provides:
|
||||
- `zitadel::axum` module with middleware for token introspection
|
||||
- `zitadel::oidc` for OpenID Connect discovery and token validation
|
||||
- `introspection_cache` feature flag for caching OIDC discovery and introspection results
|
||||
- Feature flags: `axum`, `oidc`, `credentials`, `api`, `introspection_cache`
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
zitadel = { version = "5", features = ["axum", "oidc", "introspection_cache"] }
|
||||
```
|
||||
|
||||
**SvelteKit Integration (Official)**
|
||||
Zitadel maintains an [official example](https://github.com/zitadel/example-auth-sveltekit) using `@auth/sveltekit` with:
|
||||
- PKCE authorization code flow
|
||||
- Automatic token refresh
|
||||
- Server-side session management via SvelteKit load functions
|
||||
- Federated logout with CSRF protection
|
||||
|
||||
**Self-Hosting (Simple)**
|
||||
- Single Go binary + PostgreSQL (PVM already has PG 16+)
|
||||
- Docker Compose deployment in minutes
|
||||
- v3+ requires PostgreSQL only (dropped CockroachDB)
|
||||
- Resource needs: 512MB RAM (test), 1-2GB RAM + 2-4 CPUs (production)
|
||||
|
||||
**Feature Completeness:**
|
||||
- Social login: Google, Apple, Facebook, GitHub, GitLab, Microsoft, and more
|
||||
- Email + password with customizable policies
|
||||
- Phone number authentication
|
||||
- TOTP, passkeys/FIDO2, email/SMS OTP
|
||||
- Magic links / passwordless
|
||||
- Built-in admin console (web UI)
|
||||
- Multi-tenancy with organizations
|
||||
- RBAC with roles and permissions
|
||||
- Unlimited audit trail
|
||||
- Branding and custom login pages
|
||||
- Account linking across providers
|
||||
|
||||
**Licensing:**
|
||||
- AGPL v3 as of Zitadel v3 (March 2025)
|
||||
- Using Zitadel as an identity service without modifying its source code is fine for commercial use
|
||||
- SDKs and Protocol Buffer definitions remain Apache 2.0
|
||||
- A commercial license is available if AGPL is incompatible
|
||||
|
||||
**Limitations:**
|
||||
- AGPL may concern some organizations (but not PVM's use case)
|
||||
- The Rust crate's introspection module requires network access to Zitadel for token introspection (but we use JWKS validation instead on the RPi5, which is offline-capable)
|
||||
- Resource usage spikes during password hashing (4 CPU cores recommended for production)
|
||||
|
||||
---
|
||||
|
||||
### 4.2 Ory (Kratos + Hydra) -- Runner-Up
|
||||
|
||||
**What it is:** A suite of Go microservices -- Kratos for identity management, Hydra for OAuth2/OIDC, Keto for permissions, Oathkeeper for API gateway auth.
|
||||
|
||||
**Why it's strong:**
|
||||
- Written in Go, lightweight binaries (5-15MB each), low resource usage
|
||||
- Kratos handles registration, login, MFA, social login, account recovery
|
||||
- Hydra is OpenID Certified and handles OAuth2 + JWKS endpoint
|
||||
- Auto-generated Rust SDK for both Kratos and Hydra APIs
|
||||
- Fully open source (Apache 2.0 license)
|
||||
- Can scale to billions of users (used by OpenAI per their claims)
|
||||
|
||||
**Why it loses to Zitadel for PVM:**
|
||||
- **Operational complexity:** You need to run Kratos AND Hydra as separate services, configure them to work together, and build a custom login/consent UI. This is significant engineering overhead.
|
||||
- **No built-in admin UI:** You must build or find a third-party admin interface.
|
||||
- **SvelteKit integration:** Only community examples exist (ory-kit by MarkusThielker), and development on the SvelteKit UI has stopped.
|
||||
- **Rust SDK is auto-generated:** Works but lacks the ergonomics and Axum-specific middleware of Zitadel's crate.
|
||||
- **Documentation complexity:** Setting up Kratos + Hydra together requires deep understanding of OAuth2 flows and significant configuration.
|
||||
|
||||
**Resource requirements:** Very lightweight. Kratos idles at ~15-380MB depending on configuration. Hydra is similarly lean. Total for both: 256MB-1GB RAM.
|
||||
|
||||
**Best for:** Teams that want maximum flexibility and are willing to invest in custom UI development and operational complexity.
|
||||
|
||||
---
|
||||
|
||||
### 4.3 Keycloak -- Third Place
|
||||
|
||||
**What it is:** The industry-standard open-source identity management platform, backed by Red Hat/JBoss, written in Java.
|
||||
|
||||
**Why it's considered:**
|
||||
- Most battle-tested solution in the market (used by thousands of enterprises)
|
||||
- Full OIDC/OAuth2/SAML support with standard JWKS endpoints
|
||||
- Built-in admin console, user management, social login, MFA
|
||||
- Extensive documentation and community
|
||||
- FAPI 2.0 compliant (Keycloak 26.4+)
|
||||
- JWT Authorization Grant (RFC 7523) in Keycloak 26.5
|
||||
|
||||
**Why it loses for PVM:**
|
||||
- **Java-based, resource-heavy:** Minimum 750MB RAM for a bare container, recommended 2GB for production. PVM's Hetzner PVE resources are better spent elsewhere.
|
||||
- **No Rust SDK:** You'd use generic OIDC/JWT validation crates. The REST admin API works but has no Rust client.
|
||||
- **Slower startup:** Java cold starts are measured in seconds, not milliseconds.
|
||||
- **Overkill for PVM:** Enterprise features like SAML, LDAP federation, and Kerberos add complexity without value for a poker venue SaaS.
|
||||
- **Theme customization:** Uses FreeMarker templates, which have a steep learning curve.
|
||||
|
||||
**Resource requirements:** 1.25GB RAM base (including caches), recommended 2GB+ for production. 1-2 CPU cores minimum.
|
||||
|
||||
**Best for:** Enterprises with existing Java infrastructure that need SAML/LDAP federation.
|
||||
|
||||
---
|
||||
|
||||
## 5. Recommended Architecture
|
||||
|
||||
### Overview
|
||||
|
||||
```
|
||||
CLOUD (Hetzner PVE)
|
||||
┌──────────────────────────────────────────┐
|
||||
│ │
|
||||
│ ┌─────────┐ ┌──────────────────┐ │
|
||||
│ │ Zitadel │ │ PVM Cloud API │ │
|
||||
│ │ (Auth) │◄───►│ (Rust/Axum) │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ PG DB │ │ PG DB │ │
|
||||
│ └────┬────┘ └────────┬─────────┘ │
|
||||
│ │ │ │
|
||||
│ │ JWKS endpoint │ NATS │
|
||||
│ │ /.well-known/ │ JetStream │
|
||||
│ │ jwks.json │ │
|
||||
└───────┼───────────────────┼──────────────┘
|
||||
│ │
|
||||
│ │
|
||||
════════╪═══════════════════╪═══════ INTERNET
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────────────────────────────┐
|
||||
│ RPi5 LOCAL NODE │
|
||||
│ │
|
||||
│ ┌──────────────┐ ┌─────────────────┐ │
|
||||
│ │ Cached JWKS │ │ PVM Local API │ │
|
||||
│ │ (public keys)│◄─│ (Rust binary) │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ Updated via │ │ libSQL DB │ │
|
||||
│ │ NATS sync │ │ │ │
|
||||
│ └──────────────┘ │ NATS leaf node │ │
|
||||
│ └─────────────────┘ │
|
||||
└──────────────────────────────────────────┘
|
||||
▲
|
||||
│ mDNS discovery + local API calls
|
||||
│
|
||||
┌───────┴──────┐
|
||||
│ Player Phone │
|
||||
│ (SvelteKit) │
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
### Auth Flow: Registration & Login (Cloud)
|
||||
|
||||
```
|
||||
1. Player opens PVM app (SvelteKit)
|
||||
2. App detects network connectivity --> routes to cloud
|
||||
3. Player chooses: email+password, phone+password, or social login (Google/Apple/Facebook)
|
||||
4. SvelteKit redirects to Zitadel login page (OIDC Authorization Code + PKCE)
|
||||
5. Zitadel handles the auth flow (including social OAuth if applicable)
|
||||
6. Zitadel issues:
|
||||
- Access token (JWT, signed RS256, 15min expiry)
|
||||
- Refresh token (opaque, 7-day expiry)
|
||||
- ID token (JWT with user claims)
|
||||
7. SvelteKit stores tokens (httpOnly cookies for SSR, secure storage for SPA)
|
||||
8. Cloud API validates JWT on each request using Zitadel's JWKS
|
||||
```
|
||||
|
||||
### Auth Flow: Local Node (Offline-Capable)
|
||||
|
||||
```
|
||||
1. Player phone discovers RPi5 via mDNS
|
||||
2. Phone sends request to local API with existing JWT (from cloud login)
|
||||
3. RPi5 Rust binary validates JWT:
|
||||
a. Parse JWT header to get key ID (kid)
|
||||
b. Look up public key in cached JWKS (stored in libSQL or memory)
|
||||
c. Verify RS256 signature
|
||||
d. Validate claims (exp, iss, aud, sub)
|
||||
4. If JWT is expired but refresh token is available:
|
||||
a. If cloud is reachable: refresh against Zitadel
|
||||
b. If offline: issue a short-lived local token (signed with the node's key)
|
||||
- The cloud trusts the node's public key (registered during provisioning)
|
||||
5. Request is authenticated; proceed with venue operations
|
||||
```
|
||||
|
||||
### Auth Flow: Token Refresh Strategy
|
||||
|
||||
```
|
||||
Token Refresh Decision Tree:
|
||||
│
|
||||
├── Cloud reachable?
|
||||
│ ├── YES: Refresh against Zitadel (standard OIDC refresh)
|
||||
│ │ └── New access token (15min) + new refresh token (7 days)
|
||||
│ │
|
||||
│ └── NO: Is the refresh token still valid (< 7 days)?
|
||||
│ ├── YES: Local node issues a "bridge token"
|
||||
│ │ - Signed with node's key pair
|
||||
│ │ - Short-lived (30 min)
|
||||
│ │ - Contains original user claims from the expired JWT
|
||||
│ │ - Marked with a "local_issued" claim
|
||||
│ │
|
||||
│ └── NO: User must re-authenticate when cloud is reachable
|
||||
│ (graceful degradation -- show "offline mode limited")
|
||||
```
|
||||
|
||||
### JWKS Sync Strategy
|
||||
|
||||
```
|
||||
1. On RPi5 boot / NATS reconnect:
|
||||
- Fetch JWKS from Zitadel's /.well-known/jwks.json
|
||||
- Store in libSQL (jwks table) with timestamp
|
||||
- Also cache in memory (HashMap<kid, DecodingKey>)
|
||||
|
||||
2. Periodic refresh (every 1 hour while connected):
|
||||
- Re-fetch JWKS
|
||||
- Compare with cached version
|
||||
- Update if changed (key rotation support)
|
||||
|
||||
3. Via NATS JetStream:
|
||||
- Cloud publishes "jwks.updated" event on key rotation
|
||||
- RPi5 subscribes and refreshes immediately
|
||||
|
||||
4. Offline fallback:
|
||||
- Use last cached JWKS (stored in libSQL)
|
||||
- Valid for up to 72 hours (matches offline window)
|
||||
- Include 2-3 previous key versions to handle rotation during offline period
|
||||
```
|
||||
|
||||
### Node Key Trust Model
|
||||
|
||||
```
|
||||
1. RPi5 provisioning:
|
||||
- Node generates its own RS256 key pair on first boot
|
||||
- Public key is registered with the cloud PVM API via NATS
|
||||
- Cloud stores node public keys in its database
|
||||
|
||||
2. Local token issuance (offline refresh):
|
||||
- Node signs "bridge tokens" with its private key
|
||||
- Token includes: original user sub, node_id, "local_issued" flag
|
||||
- When cloud comes back online, it can verify these tokens
|
||||
using the registered node public key
|
||||
|
||||
3. Cloud verification of local tokens:
|
||||
- Check node_id claim
|
||||
- Look up node's public key
|
||||
- Verify signature
|
||||
- Apply stricter authorization (local tokens get fewer permissions)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Considerations
|
||||
|
||||
### 6.1 Rust/Axum Backend (Cloud)
|
||||
|
||||
**Dependencies:**
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
# Zitadel integration (cloud API)
|
||||
zitadel = { version = "5", features = ["axum", "oidc", "introspection_cache"] }
|
||||
|
||||
# For the RPi5 local node (standalone JWT validation)
|
||||
jsonwebtoken = "9" # JWT creation and validation
|
||||
axum-jwt-auth = "0.4" # Axum middleware for JWT with JWKS
|
||||
|
||||
# Supporting crates
|
||||
serde = { version = "1", features = ["derive"] }
|
||||
serde_json = "1"
|
||||
reqwest = { version = "0.12", features = ["json"] } # For JWKS fetching
|
||||
```
|
||||
|
||||
**Cloud API: Token validation with Zitadel crate**
|
||||
|
||||
```rust
|
||||
use zitadel::axum::introspection::{IntrospectedUser, IntrospectionStateBuilder};
|
||||
use axum::{Router, routing::get, extract::State};
|
||||
|
||||
// Option A: Use Zitadel's introspection (requires Zitadel to be reachable)
|
||||
async fn protected_handler(user: IntrospectedUser) -> String {
|
||||
format!("Hello, {}!", user.username)
|
||||
}
|
||||
|
||||
// Option B: Use standalone JWKS validation (works offline too)
|
||||
// This is what the RPi5 uses, but the cloud can use it as well
|
||||
use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm};
|
||||
|
||||
fn validate_jwt(token: &str, jwks: &CachedJwks) -> Result<Claims, AuthError> {
|
||||
let header = jsonwebtoken::decode_header(token)?;
|
||||
let kid = header.kid.ok_or(AuthError::MissingKid)?;
|
||||
let key = jwks.get_key(&kid).ok_or(AuthError::UnknownKey)?;
|
||||
let validation = Validation::new(Algorithm::RS256);
|
||||
let token_data = decode::<Claims>(token, key, &validation)?;
|
||||
Ok(token_data.claims)
|
||||
}
|
||||
```
|
||||
|
||||
**RPi5 Local Node: Offline JWT validation**
|
||||
|
||||
```rust
|
||||
use jsonwebtoken::{decode, DecodingKey, Validation, Algorithm, jwk::JwkSet};
|
||||
|
||||
struct JwksCache {
|
||||
keys: HashMap<String, DecodingKey>,
|
||||
last_updated: DateTime<Utc>,
|
||||
}
|
||||
|
||||
impl JwksCache {
|
||||
/// Load JWKS from libSQL on startup
|
||||
async fn from_libsql(db: &Database) -> Result<Self> {
|
||||
let row = db.query("SELECT jwks_json, updated_at FROM jwks_cache ORDER BY updated_at DESC LIMIT 1").await?;
|
||||
let jwks: JwkSet = serde_json::from_str(&row.jwks_json)?;
|
||||
let keys = jwks.keys.iter()
|
||||
.filter_map(|jwk| {
|
||||
let kid = jwk.common.key_id.as_ref()?;
|
||||
let key = DecodingKey::from_jwk(jwk).ok()?;
|
||||
Some((kid.clone(), key))
|
||||
})
|
||||
.collect();
|
||||
Ok(Self { keys, last_updated: row.updated_at })
|
||||
}
|
||||
|
||||
/// Refresh from Zitadel (when online)
|
||||
async fn refresh(&mut self, zitadel_url: &str) -> Result<()> {
|
||||
let jwks_url = format!("{}/.well-known/jwks.json", zitadel_url);
|
||||
let jwks: JwkSet = reqwest::get(&jwks_url).await?.json().await?;
|
||||
// Store in libSQL for offline use
|
||||
self.store_in_libsql(&jwks).await?;
|
||||
// Update in-memory cache
|
||||
self.update_keys(&jwks);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn validate(&self, token: &str) -> Result<Claims> {
|
||||
let header = jsonwebtoken::decode_header(token)?;
|
||||
let kid = header.kid.as_ref().ok_or(AuthError::MissingKid)?;
|
||||
let key = self.keys.get(kid).ok_or(AuthError::UnknownKey)?;
|
||||
|
||||
let mut validation = Validation::new(Algorithm::RS256);
|
||||
validation.set_issuer(&["https://auth.pvm.example.com"]);
|
||||
validation.set_audience(&["pvm-api"]);
|
||||
|
||||
let data = decode::<Claims>(token, key, &validation)?;
|
||||
Ok(data.claims)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.2 SvelteKit Frontend
|
||||
|
||||
**Dependencies:**
|
||||
|
||||
```bash
|
||||
npm install @auth/sveltekit @auth/core
|
||||
```
|
||||
|
||||
**Auth.js configuration with Zitadel:**
|
||||
|
||||
```typescript
|
||||
// src/auth.ts
|
||||
import { SvelteKitAuth } from "@auth/sveltekit";
|
||||
import Zitadel from "@auth/core/providers/zitadel";
|
||||
|
||||
export const { handle, signIn, signOut } = SvelteKitAuth({
|
||||
providers: [
|
||||
Zitadel({
|
||||
issuer: "https://auth.pvm.example.com",
|
||||
clientId: env.ZITADEL_CLIENT_ID,
|
||||
clientSecret: env.ZITADEL_CLIENT_SECRET,
|
||||
authorization: {
|
||||
params: {
|
||||
scope: "openid profile email",
|
||||
},
|
||||
},
|
||||
}),
|
||||
],
|
||||
callbacks: {
|
||||
async jwt({ token, account }) {
|
||||
if (account) {
|
||||
token.accessToken = account.access_token;
|
||||
token.refreshToken = account.refresh_token;
|
||||
token.expiresAt = account.expires_at;
|
||||
}
|
||||
return token;
|
||||
},
|
||||
async session({ session, token }) {
|
||||
session.accessToken = token.accessToken;
|
||||
return session;
|
||||
},
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
**Route protection:**
|
||||
|
||||
```typescript
|
||||
// src/routes/venue/+page.server.ts
|
||||
import { redirect } from "@sveltejs/kit";
|
||||
import type { PageServerLoad } from "./$types";
|
||||
|
||||
export const load: PageServerLoad = async (event) => {
|
||||
const session = await event.locals.auth();
|
||||
if (!session) {
|
||||
throw redirect(303, "/auth/signin");
|
||||
}
|
||||
return { session };
|
||||
};
|
||||
```
|
||||
|
||||
**Dual API client (cloud vs. local):**
|
||||
|
||||
```typescript
|
||||
// src/lib/api-client.ts
|
||||
import { browser } from "$app/environment";
|
||||
|
||||
class PvmApiClient {
|
||||
private cloudUrl: string;
|
||||
private localUrl: string | null = null;
|
||||
|
||||
constructor(cloudUrl: string) {
|
||||
this.cloudUrl = cloudUrl;
|
||||
}
|
||||
|
||||
// Set when mDNS discovers a local node
|
||||
setLocalNode(url: string) {
|
||||
this.localUrl = url;
|
||||
}
|
||||
|
||||
async fetch(path: string, token: string, options?: RequestInit) {
|
||||
// Try local first (lower latency), fall back to cloud
|
||||
if (this.localUrl) {
|
||||
try {
|
||||
const res = await fetch(`${this.localUrl}${path}`, {
|
||||
...options,
|
||||
headers: { Authorization: `Bearer ${token}`, ...options?.headers },
|
||||
signal: AbortSignal.timeout(2000), // 2s timeout for local
|
||||
});
|
||||
if (res.ok) return res;
|
||||
} catch {
|
||||
// Local node unreachable, fall through to cloud
|
||||
}
|
||||
}
|
||||
|
||||
return fetch(`${this.cloudUrl}${path}`, {
|
||||
...options,
|
||||
headers: { Authorization: `Bearer ${token}`, ...options?.headers },
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6.3 Zitadel Deployment on Hetzner PVE
|
||||
|
||||
**Docker Compose (production):**
|
||||
|
||||
```yaml
|
||||
version: "3.8"
|
||||
services:
|
||||
zitadel:
|
||||
image: ghcr.io/zitadel/zitadel:v3-latest
|
||||
command: start-from-init --masterkey "${ZITADEL_MASTERKEY}"
|
||||
environment:
|
||||
ZITADEL_DATABASE_POSTGRES_HOST: postgres
|
||||
ZITADEL_DATABASE_POSTGRES_PORT: 5432
|
||||
ZITADEL_DATABASE_POSTGRES_DATABASE: zitadel
|
||||
ZITADEL_DATABASE_POSTGRES_USER_USERNAME: zitadel
|
||||
ZITADEL_DATABASE_POSTGRES_USER_PASSWORD: "${ZITADEL_DB_PASSWORD}"
|
||||
ZITADEL_EXTERNALDOMAIN: auth.pvm.example.com
|
||||
ZITADEL_EXTERNALPORT: 443
|
||||
ZITADEL_EXTERNALSECURE: "true"
|
||||
ZITADEL_TLS_MODE: external # TLS terminated at reverse proxy
|
||||
ports:
|
||||
- "8080:8080"
|
||||
depends_on:
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 2G
|
||||
cpus: "4"
|
||||
reservations:
|
||||
memory: 512M
|
||||
cpus: "1"
|
||||
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
environment:
|
||||
POSTGRES_DB: zitadel
|
||||
POSTGRES_USER: zitadel
|
||||
POSTGRES_PASSWORD: "${ZITADEL_DB_PASSWORD}"
|
||||
volumes:
|
||||
- zitadel-pg-data:/var/lib/postgresql/data
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U zitadel"]
|
||||
interval: 5s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
cpus: "2"
|
||||
|
||||
volumes:
|
||||
zitadel-pg-data:
|
||||
```
|
||||
|
||||
**Estimated resource usage on Hetzner PVE:**
|
||||
- Zitadel: 512MB-2GB RAM, 1-4 CPU cores
|
||||
- PostgreSQL (shared or dedicated): 256MB-1GB RAM
|
||||
- **Total: ~1-3GB RAM** for the auth stack
|
||||
|
||||
---
|
||||
|
||||
## 7. Security Considerations
|
||||
|
||||
### 7.1 Token Security
|
||||
|
||||
| Concern | Mitigation |
|
||||
|---|---|
|
||||
| Token theft | Short-lived access tokens (15min). Refresh tokens are opaque and stored server-side by Zitadel. |
|
||||
| Token replay | Include `iat` (issued-at) and `jti` (JWT ID) claims. Local nodes can maintain a small replay cache. |
|
||||
| Key compromise (cloud) | Zitadel supports key rotation. JWKS cache on RPi5 auto-updates. Revoke compromised keys immediately. |
|
||||
| Key compromise (node) | Each node has its own key pair. Revoke a single node's key without affecting others. |
|
||||
| Offline token abuse | "Bridge tokens" issued by local nodes are short-lived (30min) and carry reduced permissions. |
|
||||
| JWKS cache staleness | 72h maximum offline window. Keys should have longer lifetimes than this. Include previous key versions in cache. |
|
||||
|
||||
### 7.2 Social Login Security
|
||||
|
||||
- All social OAuth flows terminate at the cloud (Zitadel handles the redirect dance)
|
||||
- Zitadel validates social provider tokens and issues PVM JWTs
|
||||
- The local node never sees social provider tokens -- only PVM JWTs
|
||||
- PKCE is used for all authorization code flows (prevents code interception)
|
||||
|
||||
### 7.3 MFA Considerations
|
||||
|
||||
- TOTP enrollment happens via Zitadel (cloud)
|
||||
- TOTP verification can work offline IF the local node has the user's TOTP secret (synced via NATS)
|
||||
- **Recommendation:** For simplicity, require MFA only for sensitive operations routed to the cloud. Venue check-in at a local node uses standard JWT validation without MFA step-up.
|
||||
- Passkeys/FIDO2 require the authenticator device, which is local to the user's phone -- works offline
|
||||
|
||||
### 7.4 AGPL License Risk Assessment
|
||||
|
||||
| Scenario | Risk |
|
||||
|---|---|
|
||||
| Using Zitadel as-is (our case) | No risk. AGPL allows use as a service without source disclosure. |
|
||||
| Modifying Zitadel source code | Must share modifications under AGPL. Avoid this -- use Zitadel's extension points instead. |
|
||||
| Linking Zitadel libraries in PVM code | The SDKs are Apache 2.0, so no issue. |
|
||||
| Distributing Zitadel binary | Must provide source. Not our case -- we self-host only. |
|
||||
|
||||
### 7.5 Threat Model for Split-Brain Auth
|
||||
|
||||
**Threat:** Attacker compromises an RPi5 node and extracts the JWKS cache.
|
||||
**Impact:** Low. JWKS contains only public keys. Attacker cannot forge tokens.
|
||||
|
||||
**Threat:** Attacker compromises an RPi5 node and extracts the node's private key.
|
||||
**Impact:** Medium. Attacker can forge "bridge tokens" for that node. Mitigation: revoke the node's key via cloud, notify affected users.
|
||||
|
||||
**Threat:** Attacker presents a valid cloud JWT to a local node after the user's account is disabled in the cloud.
|
||||
**Impact:** Medium. The local node cannot check account status while offline. Mitigation: short token lifetimes (15min), and process account revocations on next NATS sync.
|
||||
|
||||
**Threat:** Replay attack with expired token during offline period.
|
||||
**Impact:** Low. JWT `exp` claim is always checked. Expired tokens are rejected regardless of network state.
|
||||
|
||||
---
|
||||
|
||||
## 8. Alternatives Considered (Detail)
|
||||
|
||||
### 8.1 Auth0
|
||||
|
||||
- **Free tier:** 25,000 MAU (B2C), 500 MAU (B2B)
|
||||
- **Pros:** Excellent documentation, many SDKs, built-in social login
|
||||
- **Cons:** Cloud-only (no self-hosting), no MFA on free tier, expensive at scale ($240/mo for pro), vendor lock-in risks for split-brain architecture. The PVM local node would depend on cached JWKS from Auth0's cloud endpoint -- any Auth0 outage affects token validation.
|
||||
- **Verdict:** Vendor dependency is unacceptable for an offline-first architecture.
|
||||
|
||||
### 8.2 Clerk
|
||||
|
||||
- **Free tier:** 10,000 MAU, 100 organizations
|
||||
- **Pros:** Great DX, community Rust SDK (`clerk-rs`), community SvelteKit SDK (`svelte-clerk`)
|
||||
- **Cons:** Cloud-only, session-based (not pure JWT), the Rust SDK is community-maintained with uncertain longevity. No self-hosting option means complete vendor dependency.
|
||||
- **Verdict:** Cloud-only with session-based auth is fundamentally incompatible with offline local nodes.
|
||||
|
||||
### 8.3 Supabase Auth (GoTrue)
|
||||
|
||||
- **Free tier:** 50,000 MAU (cloud), unlimited self-hosted
|
||||
- **Pros:** Simple JWT-based auth, supports RS256, lightweight GoTrue binary
|
||||
- **Cons:** No Rust SDK, primarily designed as part of the Supabase ecosystem. Self-hosting GoTrue independently requires running it separate from Supabase, which is poorly documented. Limited social provider configuration. No admin UI when self-hosted standalone.
|
||||
- **Verdict:** Too tightly coupled to the Supabase ecosystem. Could work as a lightweight option but lacks the identity management features PVM needs.
|
||||
|
||||
### 8.4 Logto
|
||||
|
||||
- **Free tier:** 50,000 MAU (cloud), unlimited self-hosted
|
||||
- **Pros:** Modern UI, good documentation, OIDC/OAuth 2.1 compliant, RBAC built-in, 20+ social providers
|
||||
- **Cons:** No Rust SDK (would need to use generic OIDC/JWT crates), Node.js-based (heavier than Go alternatives), relatively young project. SvelteKit support via generic OIDC.
|
||||
- **Verdict:** Strong contender but loses to Zitadel on Rust integration. If Zitadel didn't have its Rust crate, Logto would be the top pick.
|
||||
|
||||
### 8.5 SuperTokens
|
||||
|
||||
- **Free tier:** Unlimited self-hosted (open source features), 5,000 MAU cloud
|
||||
- **Pros:** Self-hosted is fully free, good documentation, session management with anti-CSRF
|
||||
- **Cons:** No Rust SDK (Node.js, Python, Go only), session-based rather than JWT-focused (would need to run a SuperTokens sidecar), requires SuperTokens core Java service alongside your backend.
|
||||
- **Verdict:** Session-based model doesn't fit split-brain offline validation. Running a Java core service adds unwanted complexity.
|
||||
|
||||
### 8.6 Hanko
|
||||
|
||||
- **Free tier:** 10,000 MAU (cloud), unlimited self-hosted (AGPL)
|
||||
- **Pros:** Passkey-first (great future-proofing), lightweight Go binary, simple API, web components for frontend
|
||||
- **Cons:** No Rust SDK, limited social login providers compared to Zitadel, smaller community, AGPL license (same as Zitadel). Passkey-first approach may frustrate users who prefer passwords.
|
||||
- **Verdict:** Interesting for passkey-first apps but too narrow for PVM's diverse auth needs (social login, email+password, phone+password).
|
||||
|
||||
### 8.7 Authentik
|
||||
|
||||
- **Free tier:** Unlimited self-hosted (open source)
|
||||
- **Pros:** Full-featured IdP, great admin UI, OIDC/OAuth2/SAML/LDAP/RADIUS support, active development
|
||||
- **Cons:** No Rust SDK, Python/Django-based (2GB+ RAM minimum), heavier than Go-based alternatives. Designed primarily as a reverse-proxy auth provider for self-hosted services (Plex, Grafana, etc.), not as an embeddable auth API.
|
||||
- **Verdict:** Excellent for homelab SSO but over-resourced and architecturally mismatched for PVM's API-first needs.
|
||||
|
||||
### 8.8 Building Custom Auth in Rust
|
||||
|
||||
**Available crates:**
|
||||
- `jsonwebtoken` (v9) -- JWT creation and validation, RS256/ES256/EdDSA
|
||||
- `oauth2` -- OAuth2 client flows
|
||||
- `totp-rs` -- TOTP generation and validation
|
||||
- `argon2` / `password-auth` -- Password hashing (Argon2id, OWASP recommended params)
|
||||
- `axum-jwt-auth` -- Axum middleware for JWT with JWKS
|
||||
- `openidconnect` -- Full OIDC client library
|
||||
|
||||
**Estimated effort:** 4-8 weeks for a full auth system with social login, password auth, MFA, session management, account recovery, email verification, and admin UI.
|
||||
|
||||
**Risks:**
|
||||
- Auth is a security-critical system; bugs lead to breaches
|
||||
- Ongoing maintenance burden (security patches, protocol updates)
|
||||
- Social login requires implementing OAuth2 flows for each provider
|
||||
- Account recovery, email verification, and brute force protection all need custom implementation
|
||||
- No admin UI out of the box
|
||||
|
||||
**Verdict:** The Rust ecosystem has excellent building blocks, but assembling them into a production auth system is a multi-month effort that Zitadel provides out of the box. The split-brain JWT validation part IS worth building custom (it's just `jsonwebtoken` + a JWKS cache), but the full identity management should be delegated to Zitadel.
|
||||
|
||||
---
|
||||
|
||||
## 9. Final Recommendation & Next Steps
|
||||
|
||||
### Architecture Decision
|
||||
|
||||
| Component | Solution |
|
||||
|---|---|
|
||||
| Identity Provider | Zitadel v3 (self-hosted on Hetzner PVE) |
|
||||
| Cloud API auth | Zitadel Rust crate (`zitadel::axum`) for introspection OR standalone JWKS validation |
|
||||
| Local node auth | Custom JWT validation using `jsonwebtoken` crate + cached JWKS |
|
||||
| Frontend auth | `@auth/sveltekit` with Zitadel OIDC provider |
|
||||
| JWKS sync | NATS JetStream + periodic HTTP fetch |
|
||||
| Token format | RS256 JWTs (access), opaque refresh tokens |
|
||||
| Database | PostgreSQL 16 (shared with or separate from PVM's main database) |
|
||||
|
||||
### Implementation Order
|
||||
|
||||
1. **Week 1:** Deploy Zitadel on Hetzner PVE (Docker Compose + PostgreSQL). Configure social providers (Google, Apple, Facebook). Set up email+password and phone auth.
|
||||
2. **Week 2:** Integrate SvelteKit frontend with Zitadel using `@auth/sveltekit`. Build login/signup flows. Test PKCE authorization code flow.
|
||||
3. **Week 3:** Integrate Cloud Rust API with Zitadel. Use `zitadel::axum` for token validation. Implement user context extraction from JWT claims.
|
||||
4. **Week 4:** Build JWKS caching on RPi5. Implement offline JWT validation with `jsonwebtoken`. Set up NATS-based JWKS sync. Test offline scenarios.
|
||||
5. **Week 5:** Implement "bridge token" issuance on RPi5 for offline token refresh. Register node public keys with cloud. Test full split-brain auth flow.
|
||||
6. **Week 6:** Enable MFA (TOTP). Configure branding and custom login pages. Security review and penetration testing.
|
||||
|
||||
### Cost Estimate
|
||||
|
||||
| Item | Cost |
|
||||
|---|---|
|
||||
| Zitadel (self-hosted) | $0 |
|
||||
| PostgreSQL (already in stack) | $0 |
|
||||
| Hetzner resources (incremental) | ~5-10 EUR/month for 2GB RAM + 2 CPU LXC |
|
||||
| Social login API keys | $0 (Google, Apple, Facebook all free) |
|
||||
| **Total** | **~5-10 EUR/month** |
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Sources
|
||||
|
||||
- [Zitadel GitHub](https://github.com/zitadel/zitadel)
|
||||
- [Zitadel Pricing](https://zitadel.com/pricing/detail)
|
||||
- [Zitadel Rust Crate (docs.rs)](https://docs.rs/zitadel/latest/zitadel/)
|
||||
- [Zitadel SvelteKit Example](https://github.com/zitadel/example-auth-sveltekit)
|
||||
- [Zitadel Self-Hosting Specs](https://help.zitadel.com/what-are-zitadel-minimum-self-hosted-specs)
|
||||
- [Zitadel v3 Announcement (AGPL, PostgreSQL)](https://zitadel.com/blog/zitadel-v3-announcement)
|
||||
- [Zitadel License FAQ](https://zitadel.com/license-faq)
|
||||
- [Ory Kratos GitHub](https://github.com/ory/kratos)
|
||||
- [Ory Hydra GitHub](https://github.com/ory/hydra)
|
||||
- [Ory Kratos Rust SDK](https://github.com/ory/kratos-client-rust)
|
||||
- [Keycloak 26.5 Release](https://www.keycloak.org/2026/01/keycloak-2650-released)
|
||||
- [Keycloak Memory Sizing](https://www.keycloak.org/high-availability/concepts-memory-and-cpu-sizing)
|
||||
- [Auth0 Pricing](https://auth0.com/pricing)
|
||||
- [Clerk Pricing](https://clerk.com/pricing)
|
||||
- [Clerk Rust SDK](https://github.com/DarrenBaldwin07/clerk-rs)
|
||||
- [Logto Pricing](https://logto.io/pricing)
|
||||
- [Logto GitHub](https://github.com/logto-io/logto)
|
||||
- [SuperTokens Pricing](https://supertokens.com/pricing)
|
||||
- [Hanko GitHub](https://github.com/teamhanko/hanko)
|
||||
- [Hanko Pricing](https://www.hanko.io/pricing)
|
||||
- [Authentik GitHub](https://github.com/goauthentik/authentik)
|
||||
- [Authentik Pricing](https://goauthentik.io/pricing/)
|
||||
- [Supabase Auth GitHub](https://github.com/supabase/auth)
|
||||
- [Supabase Pricing](https://supabase.com/pricing)
|
||||
- [jsonwebtoken Rust Crate](https://github.com/Keats/jsonwebtoken)
|
||||
- [axum-jwt-auth Crate](https://crates.io/crates/axum-jwt-auth)
|
||||
- [Edge JWT Validation Patterns](https://securityboulevard.com/2025/11/how-to-validate-jwts-efficiently-at-the-edge-with-cloudflare-workers-and-vercel/)
|
||||
Loading…
Add table
Reference in a new issue