Mikkel Georgsen c0ff95951e docs: add project research

Files:
- STACK.md: Technology stack recommendations (Python 3.12+, FastAPI, React 19+, Vite, Celery, PostgreSQL 18+)
- FEATURES.md: Feature landscape analysis (table stakes vs differentiators)
- ARCHITECTURE.md: Layered web-queue-worker architecture with SAT-based dependency resolution
- PITFALLS.md: Critical pitfalls and prevention strategies
- SUMMARY.md: Research synthesis with roadmap implications

Key findings:
- Stack: Modern 2026 async Python (FastAPI/Celery) + React/Three.js 3D frontend
- Architecture: Web-queue-worker pattern with sandboxed archiso builds
- Critical pitfall: Build sandboxing required from day one (CHAOS RAT AUR incident July 2025)

Recommended 9-phase roadmap: Infrastructure → Config → Dependency → Overlay → Build Queue → Frontend → Advanced SAT → 3D Viz → Optimization

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 02:07:11 +00:00

30 KiB

Raw Blame History

Domain Pitfalls: Linux Distribution Builder Platform

Domain: Web-based Linux distribution customization and ISO generation Researched: 2026-01-25 Confidence: MEDIUM-HIGH

Critical Pitfalls

Mistakes that cause rewrites, security breaches, or major production issues.

Pitfall 1: Unsandboxed User-Generated Package Execution

What goes wrong: User-submitted overlay packages execute arbitrary code during build with full system privileges, allowing malicious actors to compromise the build server, inject malware into generated ISOs, or exfiltrate sensitive data.

Why it happens: The archiso build process and makepkg (used for AUR packages) run without sandboxing by default. Developers assume community review is sufficient, or don't realize PKGBUILD scripts execute during the build phase, not just installation.

Consequences:

In July 2025, CHAOS RAT malware was distributed through AUR packages (librewolf-fix-bin, firefox-patch-bin, zen-browser-patched-bin) that used .install scripts to execute remote code
Compromised builds can inject backdoors into ISOs downloaded by thousands of users
Build server compromise can leak user data, API keys, or allow lateral movement to other infrastructure
Legal liability for distributing malware-infected operating systems

Prevention:

NEVER run user-submitted PKGBUILDs directly on build servers
Use systemd-nspawn, nsjail, or microVMs to isolate each build in a separate sandbox
Implement static analysis on PKGBUILD files before execution (detect suspicious commands: curl, wget, eval, base64)
Run builds in ephemeral containers discarded after each build
Implement network egress filtering for build environments (block outbound connections except to approved package mirrors)
Require manual security review for any overlay containing .install scripts or custom build steps

Detection:

Monitor build processes for unexpected network connections
Alert on PKGBUILD files containing: curl/wget with piped execution, base64 encoding, eval statements, /tmp modifications
Track build duration anomalies (malicious code often adds delays)
Log all filesystem modifications during builds
Use integrity checking to detect unauthorized binary modifications

Phase to address: Phase 1 (Core Infrastructure) - Build sandboxing must be architected from the start. Retrofitting security is nearly impossible.

Sources:

Pitfall 2: Non-Deterministic Build Reproducibility

What goes wrong: The same configuration generates different ISO hashes on different builds, making it impossible to verify ISO integrity, debug user issues, or implement proper caching. Cache invalidation becomes unreliable, causing excessive rebuilds or stale builds.

Why it happens: Timestamps in build artifacts, non-deterministic file ordering, parallel build race conditions, leaked build environment variables, and external dependency fetches introduce randomness.

Consequences:

Cache invalidation strategies fail (can't detect if upstream changes require rebuild)
Users report bugs that can't be reproduced
Security auditing becomes impossible (can't verify ISO hasn't been tampered with)
Build queue backs up from unnecessary rebuilds
Wasted compute resources rebuilding identical configurations

Prevention:

Normalize all timestamps using SOURCE_DATE_EPOCH environment variable
Sort input files deterministically before processing
Use fixed locales (LC_ALL=C)
Pin compiler versions and toolchain
Disable ASLR during builds (affects compiler output)
Use --clamp-mtime for filesystem timestamps
Implement hermetic builds (no network access, all dependencies pre-fetched)
Configure archiso with reproducible options:
- Disable CONFIG_MODULE_SIG_ALL (generates random keys)
- Pin git commits (don't use HEAD/branch names)
- Use fixed compression levels and algorithms

Detection:

Automated testing: build same config twice, compare checksums
Monitor cache hit rate (sudden drops indicate non-determinism)
Track build output size variance for identical configs
Diff filesystem trees from duplicate builds

Phase to address: Phase 1 (Core Infrastructure) - Reproducibility must be designed into the build pipeline from the start.

Sources:

Pitfall 3: Upstream Breaking Changes Without Version Pinning

What goes wrong: Omarchy or CachyOS repositories update packages with breaking changes. Suddenly all builds fail with cryptic dependency errors, incompatible kernel modules, or missing packages. No coordination exists to warn of changes.

Why it happens: Relying on rolling release repositories (Arch, CachyOS) without pinning versions. Assuming upstream maintainers will preserve compatibility. Not monitoring upstream changelogs.

Consequences:

All user builds fail simultaneously when upstream updates
Emergency firefighting to identify breaking changes
User trust erosion ("the platform is unreliable")
CachyOS experienced frequent kernel stability issues in 2025, requiring LTS fallback
Dependency mismatches between Arch and CachyOS v3 repositories in October 2025

Prevention:

Pin package repository snapshots by date (use https://archive.archlinux.org/ or equivalent)
Implement a staging environment that tests against latest upstream before promoting to production
Monitor upstream repositories for breaking changes:
- Subscribe to CachyOS announcement channels
- Track Arch Linux security advisories
- Monitor package version changes daily
Implement gradual rollout: test builds with 1% of traffic before full deployment
Provide repository version selection in UI ("stable" = 1 month old, "latest" = current)
Cache known-good package sets and allow rollback
Document which Omarchy/CachyOS features are used and monitor their changelog

Detection:

Automated canary builds every 6 hours against latest repos
Alert when build failure rate exceeds threshold
Track dependency resolution errors
Monitor upstream package version drift

Phase to address: Phase 2 (Build Pipeline) - After basic builds work, implement upstream isolation.

Sources:

Pitfall 4: Dependency Hell Across Hundreds of Overlays

What goes wrong: User selects multiple overlays that declare conflicting package versions or file ownership. Build fails with "conflicting files" errors. Alternatively, build succeeds but generates a broken ISO where applications crash or won't start.

Why it happens: Package managers (pacman, apt) don't automatically resolve conflicts between third-party overlays. Multiple overlays might modify the same config file. No validation of overlay compatibility occurs during selection.

Consequences:

Build fails after 15 minutes of package installation
User gets cryptic error: "file /etc/foo.conf exists in packages A and B"
Generated ISO boots but applications don't work
User blames platform instead of specific overlay combination
Support burden: every overlay combination creates unique failure modes

Prevention:

Pre-validate overlay compatibility during upload:
- Extract file lists from packages
- Check for file conflicts between overlays
- Tag overlays as mutually exclusive
Implement dependency solver that detects conflicts before build starts:
- Use SAT solver or constraint solver to validate overlay combinations
- Show "conflict graph" in UI when incompatible overlays selected
Provide curated overlay collections known to work together
Generate warning when user selects overlays with overlapping file ownership
Implement priority system (if conflict, package from higher-priority overlay wins)
Test common overlay combinations in CI

Detection:

Parse pacman/apt error messages for "conflicting files"
Track which overlay combinations fail most frequently
Monitor user retry patterns (same user rebuilding with fewer overlays)
Collect telemetry on successful vs failed overlay combinations

Phase to address: Phase 3 (Overlay System) - When overlay selection UI is implemented.

Sources:

Pitfall 5: Cache Invalidation False Negatives

What goes wrong: Upstream package updates but cached build is still served. Users download ISOs with outdated packages containing known CVEs. Security scanners flag ISOs as vulnerable.

Why it happens: Cache invalidation logic doesn't account for transitive dependencies. Package A updates, but cache key only checks direct dependencies. Alternatively, rolling release repos mean "latest" points to different package versions over time.

Consequences:

Users install ISOs with security vulnerabilities
Platform reputation damage ("distributing outdated software")
Legal liability if vulnerable software causes data breaches
Users manually discover their ISO is outdated and distrust platform

Prevention:

Include full dependency tree hash in cache key, not just direct dependencies
Implement time-based cache expiry (max 7 days for rolling release)
Track package repository snapshot timestamps in cache metadata
Invalidate cache when ANY package in the tree updates, not just overlay packages
Provide "force rebuild with latest packages" option in UI
Display build timestamp and package versions prominently in ISO metadata
Run vulnerability scanning (grype, trivy) on generated ISOs before serving

Detection:

Compare package versions in cached ISO vs current repository
Alert when cached ISOs are served > 14 days old
Monitor CVE databases for packages in cached ISOs
Track user reports of "outdated packages"

Phase to address: Phase 2 (Build Pipeline) - When caching is implemented.

Sources:

Moderate Pitfalls

Mistakes that cause delays, poor UX, or technical debt.

Pitfall 6: 3D Visualization Performance Degradation

What goes wrong: Beautiful 3D package visualizations work perfectly on developer machines (RTX 4090) but run at 5fps on target users' mid-range laptops. Page becomes unusable. Users blame "bloated web apps."

Why it happens: Not testing on mid-range hardware. Using unoptimized Three.js scenes with too many draw calls. No progressive enhancement or fallback to 2D views. WebGL single-threaded bottleneck starves GPU.

Consequences:

Target users ("Windows refugees" with 3-year-old laptops) can't use the platform
High bounce rate from slow page load
Negative reviews: "looks pretty but unusable"
Mobile users completely locked out
Battery drain on laptops

Prevention:

Test on mid-range hardware from day one (Intel integrated graphics, GTX 1650)
Implement Level of Detail (LOD): reduce geometry complexity for distant objects
Use instancing for repeated elements (package icons)
Move rendering to Web Worker with OffscreenCanvas to unblock main thread
Consider WebGPU migration for parallel command encoding (reduces CPU bottleneck)
Provide 2D fallback UI for low-end devices
Lazy load 3D view (show 2D list first, load 3D on interaction)
Set performance budget: 60fps on Intel UHD Graphics 620
Implement automatic quality adjustment based on frame rate

Detection:

Monitor FPS via Performance API in production
Track GPU utilization (available via WebGL extensions)
A/B test: measure conversion rate for 3D vs 2D view
Collect device/GPU telemetry to understand user hardware

Phase to address: Phase 4 (3D Visualization) - During 3D UI development, enforce performance requirements.

Sources:

Pitfall 7: Build Queue Starvation and Resource Contention

What goes wrong: During peak hours, build queue fills up. New builds wait 2 hours. Meanwhile, 10 builds for the same configuration are queued because different users requested identical overlays. Resources wasted on duplicate work.

Why it happens: No build deduplication. FIFO queue without prioritization. Fixed pool of build workers regardless of load. Not leveraging cache hits to avoid builds.

Consequences:

Poor user experience (long wait times)
Wasted compute resources on duplicate builds
Scaling costs spike during traffic bursts
Users retry, adding more duplicate builds to queue
Platform appears slow and unreliable

Prevention:

Implement build deduplication:
- Hash configuration (packages + overlays + options)
- If identical build in queue or recently completed, return same result
- Show "joining existing build" UI to set expectations
Add queue priority levels:
- Cache hit = instant (no build needed)
- Existing identical build = join queue position
- Small overlay = higher priority than full rebuild
- Authenticated users > anonymous
Autoscale build workers based on queue depth (Kubernetes HPA)
Show queue position and estimated wait time in UI
Implement progressive caching (overlay-level caching, not just full ISO)
Reserve capacity for fast/small builds to prevent queue starvation

Detection:

Monitor queue depth over time
Track build deduplication hit rate
Measure p95 wait time
Alert when wait time exceeds SLA (e.g., >10 minutes)
Analyze duplicate builds (same config hash queued multiple times)

Phase to address: Phase 5 (Scaling) - After MVP proves demand exists.

Sources:

Pitfall 8: Archiso Breaking Changes in Updates

What goes wrong: Platform uses archiso v85, which has certain boot mode configurations. Archiso updates to v86+ with unified boot modes. Suddenly all builds fail with "invalid boot mode" errors.

Why it happens: Relying on latest archiso package without pinning version. Not monitoring archiso changelog. Assuming backward compatibility in tooling.

Consequences:

All builds fail when archiso updates
Emergency debugging session to identify breaking change
Must rewrite build configuration for new archiso API
User builds stuck until fix deployed

Prevention:

Pin archiso version in build environment (don't use rolling latest)
Monitor archiso changelog: https://github.com/archlinux/archiso/blob/master/CHANGELOG.rst
Test against new archiso versions in staging before upgrading production
Notable breaking changes to watch:
- v86 (Sept 2025): Boot mode consolidation (bios.syslinux replaces bios.syslinux.eltorito/mbr)
- v87 (Oct 2025): Bootstrap package config changes
- Boot parameter changes: archisodevice → archisosearchuuid
Abstract archiso-specific config behind internal API (easier to update)
Maintain compatibility layer for multiple archiso versions

Detection:

Automated builds against latest archiso in CI
Alert on archiso package version changes in upstream repos
Parse archiso error messages for "unknown boot mode" or deprecation warnings

Phase to address: Phase 2 (Build Pipeline) - When archiso integration is implemented.

Sources:

Pitfall 9: Beginner UX Assumes Linux Knowledge

What goes wrong: UI uses jargon like "initramfs", "systemd units", "GRUB config". Users see errors like "failed to install linux-firmware" with no explanation. Windows refugees feel overwhelmed and leave.

Why it happens: Developers are Linux experts, forgetting target users aren't. Passing raw build errors to UI without translation. No onboarding flow explaining concepts.

Consequences:

High bounce rate from non-technical users
Support burden: answering basic Linux questions
Negative word-of-mouth: "too complicated"
Failed promise of making Linux accessible
Common beginner mistakes from 2026 research:
- Installing incompatible packages (wrong architecture, conflicting dependencies)
- Not understanding difference between LTS and rolling release
- Customizing too much at once, breaking desktop environment

Prevention:

Translate technical errors to plain language:
- "Failed to install linux-firmware" → "Your ISO needs device drivers. This is normal and will be included."
- "Conflicting packages" → "Two of your selected packages can't be installed together. Try removing [X] or [Y]."
Implement guided mode with curated options (vs advanced mode with full control)
Add tooltips explaining Linux concepts:
- Desktop environment (with screenshots)
- LTS vs rolling release (stability vs latest features)
- Package manager basics
Provide templates: "Windows-like", "macOS-like", "Developer workstation"
Show visual previews of desktop environments, not just names
Implement "test in browser" feature (preview DE without downloading ISO)
User testing with actual Windows refugees, not Linux users

Detection:

Track where users abandon the flow (heatmaps, analytics)
Monitor support tickets for recurring questions
A/B test simplified vs technical language
Survey users: "How confusing was this? 1-5"

Phase to address: Phase 6 (Polish & Onboarding) - After core features work, focus on UX refinement.

Sources:

Pitfall 10: ISO Download Reliability Issues

What goes wrong: User customizes ISO, clicks download, and gets 2.5GB file transfer. Browser crashes at 80%. Or network hiccups cause corruption. User re-customizes and re-downloads, wasting build resources.

Why it happens: Using direct file downloads without resume support. No integrity checking before use. Not leveraging browser download manager capabilities.

Consequences:

User frustration from failed downloads
Wasted bandwidth (re-downloading)
Corrupted ISOs that fail to boot (user blames platform)
Support burden from "ISO won't boot" issues

Prevention:

Implement resumable downloads (HTTP Range requests)
Provide torrent option for large ISOs
Display SHA256 checksum prominently with instructions to verify
Use Content-Disposition header to set filename (debate-custom-2026-01-25.iso)
Consider chunked download with client-side reassembly
For PWA approach: Use Background Fetch API for large downloads
- Download continues even if tab closed
- Browser shows persistent UI for download progress
- Better reliability on mobile/flaky connections
Show download progress (not just "downloading...")
Provide "test ISO in browser" option (emulator) before download

Detection:

Track download completion rate (started vs finished)
Monitor download retry patterns
Analyze user reports of "corrupted ISO"
Track checksum verification usage

Phase to address: Phase 5 (Distribution) - After ISOs are being generated.

Sources:

Minor Pitfalls

Mistakes that cause annoyance but are relatively easy to fix.

Pitfall 11: Insecure Default Configurations

What goes wrong: Generated ISOs have default passwords (root/toor), SSH enabled with password auth, or autologin configured. User deploys to production and gets compromised.

Why it happens: Copying archiso baseline defaults without hardening. Assuming users will secure their systems post-install. Making convenience the default over security.

Consequences:

Generated ISOs are insecure by default
Users deploy vulnerable systems
Platform reputation damage if incidents occur
Archiso baseline includes autologin by default

Prevention:

Override insecure archiso defaults:
- Disable autologin (remove autologin.conf)
- Require password setup during ISO customization
- Disable SSH or require key-based auth
Provide security checklist in UI:
- "Will this ISO be used on the internet?" → Disable password auth
- "Will this be installed on physical hardware?" → Enable disk encryption
Show security warnings for risky configurations
Default to secure, allow opting into convenience features

Detection:

Static analysis of generated ISO configs
Alert on ISOs with default passwords or autologin
Track which security features are enabled/disabled

Phase to address: Phase 3 (Configuration) - When users can customize security settings.

Sources:

Archiso security considerations

Pitfall 12: Inadequate Build Logging and Debugging

What goes wrong: User reports "my build failed" with no details. Build logs are 10MB of pacman output. Error message buried on line 8,432. Impossible to debug without reproduction.

Why it happens: Logging everything without structure. No log aggregation or parsing. Not extracting key errors for display.

Consequences:

Support burden (need full logs to debug)
Users can't self-service debug
Repeated builds to add debug logging
Difficult to identify systematic issues

Prevention:

Structure logs with severity levels (INFO, WARN, ERROR)
Extract and highlight fatal errors in UI
Provide "debug mode" that shows full logs
Store build logs for 30 days with unique build ID
Implement log search/filter in UI
Add build context to logs (config hash, overlay versions, timestamp)
Common errors should have KB articles linked

Detection:

Track support tickets requesting logs
Monitor build failure rate by error type
Analyze which errors lead to user retry vs abandonment

Phase to address: Phase 2 (Build Pipeline) - Implement with build infrastructure.

Sources:

Build automation best practices

Pitfall 13: Package Repository Mirror Failures

What goes wrong: Build relies on mirrors.cachyos.org. Mirror goes down during build. Build fails with "failed to download packages". Build queue backs up.

Why it happens: Single point of failure for package sources. Not implementing mirror fallback. Assuming mirrors have 100% uptime.

Consequences:

Builds fail during mirror outages
User sees "server error" with no explanation
Build queue fills with retries

Prevention:

Configure multiple mirrors in pacman.conf (fallback)
Cache frequently-used packages on build infrastructure
Implement retry logic with exponential backoff
Monitor mirror health and automatically disable unhealthy mirrors
Provide user feedback: "Package mirror temporarily unavailable, retrying..."

Detection:

Monitor mirror response times and availability
Alert on increased build failures from download errors
Track which mirrors cause failures

Phase to address: Phase 2 (Build Pipeline) - When package downloading is implemented.

Sources:

CachyOS optimized repositories

Phase-Specific Warnings

Phase	Likely Pitfall	Mitigation
Phase 1: Core Infrastructure	Unsandboxed build execution (Critical #1)	Design build isolation from day one using systemd-nspawn or microVMs
Phase 1: Core Infrastructure	Non-deterministic builds (Critical #2)	Implement reproducible build practices immediately
Phase 2: Build Pipeline	Upstream breaking changes (Critical #3)	Pin repository snapshots, test against staging
Phase 2: Build Pipeline	Cache invalidation bugs (Critical #5)	Include dependency tree hash in cache key
Phase 3: Overlay System	Dependency hell (Critical #4)	Pre-validate overlay compatibility, implement conflict detection
Phase 4: 3D Visualization	Performance on mid-range hardware (Moderate #6)	Test on target hardware, implement LOD and fallbacks
Phase 5: Scaling	Build queue starvation (Moderate #7)	Implement build deduplication and autoscaling
Phase 6: Polish	Beginner UX (Moderate #9)	User test with Windows refugees, translate jargon

Validation Checklist

Before launching each phase, verify:

Phase 1 (Infrastructure):

All builds run in isolated sandboxes (no host system access)
Same configuration generates identical checksum 3 times in a row
Build logs structured and searchable
Failed builds provide actionable error messages

Phase 2 (Build Pipeline):

Package repository versions pinned/snapshotted
Mirror fallback configured and tested
Cache invalidation includes transitive dependencies
Staging environment tests against latest upstream

Phase 3 (Overlay System):

File conflict detection runs before build
Incompatible overlays show warning in UI
Dependency solver validates combinations

Phase 4 (3D Visualization):

Achieves 60fps on Intel UHD Graphics 620
2D fallback available for low-end devices
Frame rate monitoring in production

Phase 5 (Scaling):

Build deduplication prevents duplicate work
Queue autoscaling based on depth
p95 wait time under SLA

Phase 6 (Polish):

User tested with non-technical "Windows refugees"
Technical jargon translated to plain language
Download resume support implemented
Security defaults enabled

Sources

Security & Malware:

Reproducible Builds:

Archiso & Build Systems:

Dependency & Package Management:

Performance & Scaling:

User Experience:

Progressive Web Apps:

Security & CVEs:

30 KiB Raw Blame History

Domain Pitfalls: Linux Distribution Builder Platform

Critical Pitfalls

Pitfall 1: Unsandboxed User-Generated Package Execution

Pitfall 2: Non-Deterministic Build Reproducibility

Pitfall 3: Upstream Breaking Changes Without Version Pinning

Pitfall 4: Dependency Hell Across Hundreds of Overlays

Pitfall 5: Cache Invalidation False Negatives

Moderate Pitfalls

Pitfall 6: 3D Visualization Performance Degradation

Pitfall 7: Build Queue Starvation and Resource Contention

Pitfall 8: Archiso Breaking Changes in Updates

Pitfall 9: Beginner UX Assumes Linux Knowledge

Pitfall 10: ISO Download Reliability Issues

Minor Pitfalls

Pitfall 11: Insecure Default Configurations

Pitfall 12: Inadequate Build Logging and Debugging

Pitfall 13: Package Repository Mirror Failures

Phase-Specific Warnings

Validation Checklist

Sources

30 KiB

Raw Blame History