Files: - STACK.md: Technology stack recommendations (Python 3.12+, FastAPI, React 19+, Vite, Celery, PostgreSQL 18+) - FEATURES.md: Feature landscape analysis (table stakes vs differentiators) - ARCHITECTURE.md: Layered web-queue-worker architecture with SAT-based dependency resolution - PITFALLS.md: Critical pitfalls and prevention strategies - SUMMARY.md: Research synthesis with roadmap implications Key findings: - Stack: Modern 2026 async Python (FastAPI/Celery) + React/Three.js 3D frontend - Architecture: Web-queue-worker pattern with sandboxed archiso builds - Critical pitfall: Build sandboxing required from day one (CHAOS RAT AUR incident July 2025) Recommended 9-phase roadmap: Infrastructure → Config → Dependency → Overlay → Build Queue → Frontend → Advanced SAT → 3D Viz → Optimization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
30 KiB
Domain Pitfalls: Linux Distribution Builder Platform
Domain: Web-based Linux distribution customization and ISO generation Researched: 2026-01-25 Confidence: MEDIUM-HIGH
Critical Pitfalls
Mistakes that cause rewrites, security breaches, or major production issues.
Pitfall 1: Unsandboxed User-Generated Package Execution
What goes wrong: User-submitted overlay packages execute arbitrary code during build with full system privileges, allowing malicious actors to compromise the build server, inject malware into generated ISOs, or exfiltrate sensitive data.
Why it happens: The archiso build process and makepkg (used for AUR packages) run without sandboxing by default. Developers assume community review is sufficient, or don't realize PKGBUILD scripts execute during the build phase, not just installation.
Consequences:
- In July 2025, CHAOS RAT malware was distributed through AUR packages (librewolf-fix-bin, firefox-patch-bin, zen-browser-patched-bin) that used .install scripts to execute remote code
- Compromised builds can inject backdoors into ISOs downloaded by thousands of users
- Build server compromise can leak user data, API keys, or allow lateral movement to other infrastructure
- Legal liability for distributing malware-infected operating systems
Prevention:
- NEVER run user-submitted PKGBUILDs directly on build servers
- Use systemd-nspawn, nsjail, or microVMs to isolate each build in a separate sandbox
- Implement static analysis on PKGBUILD files before execution (detect suspicious commands: curl, wget, eval, base64)
- Run builds in ephemeral containers discarded after each build
- Implement network egress filtering for build environments (block outbound connections except to approved package mirrors)
- Require manual security review for any overlay containing .install scripts or custom build steps
Detection:
- Monitor build processes for unexpected network connections
- Alert on PKGBUILD files containing: curl/wget with piped execution, base64 encoding, eval statements, /tmp modifications
- Track build duration anomalies (malicious code often adds delays)
- Log all filesystem modifications during builds
- Use integrity checking to detect unauthorized binary modifications
Phase to address: Phase 1 (Core Infrastructure) - Build sandboxing must be architected from the start. Retrofitting security is nearly impossible.
Sources:
Pitfall 2: Non-Deterministic Build Reproducibility
What goes wrong: The same configuration generates different ISO hashes on different builds, making it impossible to verify ISO integrity, debug user issues, or implement proper caching. Cache invalidation becomes unreliable, causing excessive rebuilds or stale builds.
Why it happens: Timestamps in build artifacts, non-deterministic file ordering, parallel build race conditions, leaked build environment variables, and external dependency fetches introduce randomness.
Consequences:
- Cache invalidation strategies fail (can't detect if upstream changes require rebuild)
- Users report bugs that can't be reproduced
- Security auditing becomes impossible (can't verify ISO hasn't been tampered with)
- Build queue backs up from unnecessary rebuilds
- Wasted compute resources rebuilding identical configurations
Prevention:
- Normalize all timestamps using SOURCE_DATE_EPOCH environment variable
- Sort input files deterministically before processing
- Use fixed locales (LC_ALL=C)
- Pin compiler versions and toolchain
- Disable ASLR during builds (affects compiler output)
- Use
--clamp-mtimefor filesystem timestamps - Implement hermetic builds (no network access, all dependencies pre-fetched)
- Configure archiso with reproducible options:
- Disable CONFIG_MODULE_SIG_ALL (generates random keys)
- Pin git commits (don't use HEAD/branch names)
- Use fixed compression levels and algorithms
Detection:
- Automated testing: build same config twice, compare checksums
- Monitor cache hit rate (sudden drops indicate non-determinism)
- Track build output size variance for identical configs
- Diff filesystem trees from duplicate builds
Phase to address: Phase 1 (Core Infrastructure) - Reproducibility must be designed into the build pipeline from the start.
Sources:
- Reproducible builds documentation
- Linux Kernel reproducible builds
- Three pillars of reproducible builds
Pitfall 3: Upstream Breaking Changes Without Version Pinning
What goes wrong: Omarchy or CachyOS repositories update packages with breaking changes. Suddenly all builds fail with cryptic dependency errors, incompatible kernel modules, or missing packages. No coordination exists to warn of changes.
Why it happens: Relying on rolling release repositories (Arch, CachyOS) without pinning versions. Assuming upstream maintainers will preserve compatibility. Not monitoring upstream changelogs.
Consequences:
- All user builds fail simultaneously when upstream updates
- Emergency firefighting to identify breaking changes
- User trust erosion ("the platform is unreliable")
- CachyOS experienced frequent kernel stability issues in 2025, requiring LTS fallback
- Dependency mismatches between Arch and CachyOS v3 repositories in October 2025
Prevention:
- Pin package repository snapshots by date (use https://archive.archlinux.org/ or equivalent)
- Implement a staging environment that tests against latest upstream before promoting to production
- Monitor upstream repositories for breaking changes:
- Subscribe to CachyOS announcement channels
- Track Arch Linux security advisories
- Monitor package version changes daily
- Implement gradual rollout: test builds with 1% of traffic before full deployment
- Provide repository version selection in UI ("stable" = 1 month old, "latest" = current)
- Cache known-good package sets and allow rollback
- Document which Omarchy/CachyOS features are used and monitor their changelog
Detection:
- Automated canary builds every 6 hours against latest repos
- Alert when build failure rate exceeds threshold
- Track dependency resolution errors
- Monitor upstream package version drift
Phase to address: Phase 2 (Build Pipeline) - After basic builds work, implement upstream isolation.
Sources:
Pitfall 4: Dependency Hell Across Hundreds of Overlays
What goes wrong: User selects multiple overlays that declare conflicting package versions or file ownership. Build fails with "conflicting files" errors. Alternatively, build succeeds but generates a broken ISO where applications crash or won't start.
Why it happens: Package managers (pacman, apt) don't automatically resolve conflicts between third-party overlays. Multiple overlays might modify the same config file. No validation of overlay compatibility occurs during selection.
Consequences:
- Build fails after 15 minutes of package installation
- User gets cryptic error: "file /etc/foo.conf exists in packages A and B"
- Generated ISO boots but applications don't work
- User blames platform instead of specific overlay combination
- Support burden: every overlay combination creates unique failure modes
Prevention:
- Pre-validate overlay compatibility during upload:
- Extract file lists from packages
- Check for file conflicts between overlays
- Tag overlays as mutually exclusive
- Implement dependency solver that detects conflicts before build starts:
- Use SAT solver or constraint solver to validate overlay combinations
- Show "conflict graph" in UI when incompatible overlays selected
- Provide curated overlay collections known to work together
- Generate warning when user selects overlays with overlapping file ownership
- Implement priority system (if conflict, package from higher-priority overlay wins)
- Test common overlay combinations in CI
Detection:
- Parse pacman/apt error messages for "conflicting files"
- Track which overlay combinations fail most frequently
- Monitor user retry patterns (same user rebuilding with fewer overlays)
- Collect telemetry on successful vs failed overlay combinations
Phase to address: Phase 3 (Overlay System) - When overlay selection UI is implemented.
Sources:
Pitfall 5: Cache Invalidation False Negatives
What goes wrong: Upstream package updates but cached build is still served. Users download ISOs with outdated packages containing known CVEs. Security scanners flag ISOs as vulnerable.
Why it happens: Cache invalidation logic doesn't account for transitive dependencies. Package A updates, but cache key only checks direct dependencies. Alternatively, rolling release repos mean "latest" points to different package versions over time.
Consequences:
- Users install ISOs with security vulnerabilities
- Platform reputation damage ("distributing outdated software")
- Legal liability if vulnerable software causes data breaches
- Users manually discover their ISO is outdated and distrust platform
Prevention:
- Include full dependency tree hash in cache key, not just direct dependencies
- Implement time-based cache expiry (max 7 days for rolling release)
- Track package repository snapshot timestamps in cache metadata
- Invalidate cache when ANY package in the tree updates, not just overlay packages
- Provide "force rebuild with latest packages" option in UI
- Display build timestamp and package versions prominently in ISO metadata
- Run vulnerability scanning (grype, trivy) on generated ISOs before serving
Detection:
- Compare package versions in cached ISO vs current repository
- Alert when cached ISOs are served > 14 days old
- Monitor CVE databases for packages in cached ISOs
- Track user reports of "outdated packages"
Phase to address: Phase 2 (Build Pipeline) - When caching is implemented.
Sources:
Moderate Pitfalls
Mistakes that cause delays, poor UX, or technical debt.
Pitfall 6: 3D Visualization Performance Degradation
What goes wrong: Beautiful 3D package visualizations work perfectly on developer machines (RTX 4090) but run at 5fps on target users' mid-range laptops. Page becomes unusable. Users blame "bloated web apps."
Why it happens: Not testing on mid-range hardware. Using unoptimized Three.js scenes with too many draw calls. No progressive enhancement or fallback to 2D views. WebGL single-threaded bottleneck starves GPU.
Consequences:
- Target users ("Windows refugees" with 3-year-old laptops) can't use the platform
- High bounce rate from slow page load
- Negative reviews: "looks pretty but unusable"
- Mobile users completely locked out
- Battery drain on laptops
Prevention:
- Test on mid-range hardware from day one (Intel integrated graphics, GTX 1650)
- Implement Level of Detail (LOD): reduce geometry complexity for distant objects
- Use instancing for repeated elements (package icons)
- Move rendering to Web Worker with OffscreenCanvas to unblock main thread
- Consider WebGPU migration for parallel command encoding (reduces CPU bottleneck)
- Provide 2D fallback UI for low-end devices
- Lazy load 3D view (show 2D list first, load 3D on interaction)
- Set performance budget: 60fps on Intel UHD Graphics 620
- Implement automatic quality adjustment based on frame rate
Detection:
- Monitor FPS via Performance API in production
- Track GPU utilization (available via WebGL extensions)
- A/B test: measure conversion rate for 3D vs 2D view
- Collect device/GPU telemetry to understand user hardware
Phase to address: Phase 4 (3D Visualization) - During 3D UI development, enforce performance requirements.
Sources:
Pitfall 7: Build Queue Starvation and Resource Contention
What goes wrong: During peak hours, build queue fills up. New builds wait 2 hours. Meanwhile, 10 builds for the same configuration are queued because different users requested identical overlays. Resources wasted on duplicate work.
Why it happens: No build deduplication. FIFO queue without prioritization. Fixed pool of build workers regardless of load. Not leveraging cache hits to avoid builds.
Consequences:
- Poor user experience (long wait times)
- Wasted compute resources on duplicate builds
- Scaling costs spike during traffic bursts
- Users retry, adding more duplicate builds to queue
- Platform appears slow and unreliable
Prevention:
- Implement build deduplication:
- Hash configuration (packages + overlays + options)
- If identical build in queue or recently completed, return same result
- Show "joining existing build" UI to set expectations
- Add queue priority levels:
- Cache hit = instant (no build needed)
- Existing identical build = join queue position
- Small overlay = higher priority than full rebuild
- Authenticated users > anonymous
- Autoscale build workers based on queue depth (Kubernetes HPA)
- Show queue position and estimated wait time in UI
- Implement progressive caching (overlay-level caching, not just full ISO)
- Reserve capacity for fast/small builds to prevent queue starvation
Detection:
- Monitor queue depth over time
- Track build deduplication hit rate
- Measure p95 wait time
- Alert when wait time exceeds SLA (e.g., >10 minutes)
- Analyze duplicate builds (same config hash queued multiple times)
Phase to address: Phase 5 (Scaling) - After MVP proves demand exists.
Sources:
Pitfall 8: Archiso Breaking Changes in Updates
What goes wrong: Platform uses archiso v85, which has certain boot mode configurations. Archiso updates to v86+ with unified boot modes. Suddenly all builds fail with "invalid boot mode" errors.
Why it happens: Relying on latest archiso package without pinning version. Not monitoring archiso changelog. Assuming backward compatibility in tooling.
Consequences:
- All builds fail when archiso updates
- Emergency debugging session to identify breaking change
- Must rewrite build configuration for new archiso API
- User builds stuck until fix deployed
Prevention:
- Pin archiso version in build environment (don't use rolling latest)
- Monitor archiso changelog: https://github.com/archlinux/archiso/blob/master/CHANGELOG.rst
- Test against new archiso versions in staging before upgrading production
- Notable breaking changes to watch:
- v86 (Sept 2025): Boot mode consolidation (bios.syslinux replaces bios.syslinux.eltorito/mbr)
- v87 (Oct 2025): Bootstrap package config changes
- Boot parameter changes: archisodevice → archisosearchuuid
- Abstract archiso-specific config behind internal API (easier to update)
- Maintain compatibility layer for multiple archiso versions
Detection:
- Automated builds against latest archiso in CI
- Alert on archiso package version changes in upstream repos
- Parse archiso error messages for "unknown boot mode" or deprecation warnings
Phase to address: Phase 2 (Build Pipeline) - When archiso integration is implemented.
Sources:
Pitfall 9: Beginner UX Assumes Linux Knowledge
What goes wrong: UI uses jargon like "initramfs", "systemd units", "GRUB config". Users see errors like "failed to install linux-firmware" with no explanation. Windows refugees feel overwhelmed and leave.
Why it happens: Developers are Linux experts, forgetting target users aren't. Passing raw build errors to UI without translation. No onboarding flow explaining concepts.
Consequences:
- High bounce rate from non-technical users
- Support burden: answering basic Linux questions
- Negative word-of-mouth: "too complicated"
- Failed promise of making Linux accessible
- Common beginner mistakes from 2026 research:
- Installing incompatible packages (wrong architecture, conflicting dependencies)
- Not understanding difference between LTS and rolling release
- Customizing too much at once, breaking desktop environment
Prevention:
- Translate technical errors to plain language:
- "Failed to install linux-firmware" → "Your ISO needs device drivers. This is normal and will be included."
- "Conflicting packages" → "Two of your selected packages can't be installed together. Try removing [X] or [Y]."
- Implement guided mode with curated options (vs advanced mode with full control)
- Add tooltips explaining Linux concepts:
- Desktop environment (with screenshots)
- LTS vs rolling release (stability vs latest features)
- Package manager basics
- Provide templates: "Windows-like", "macOS-like", "Developer workstation"
- Show visual previews of desktop environments, not just names
- Implement "test in browser" feature (preview DE without downloading ISO)
- User testing with actual Windows refugees, not Linux users
Detection:
- Track where users abandon the flow (heatmaps, analytics)
- Monitor support tickets for recurring questions
- A/B test simplified vs technical language
- Survey users: "How confusing was this? 1-5"
Phase to address: Phase 6 (Polish & Onboarding) - After core features work, focus on UX refinement.
Sources:
Pitfall 10: ISO Download Reliability Issues
What goes wrong: User customizes ISO, clicks download, and gets 2.5GB file transfer. Browser crashes at 80%. Or network hiccups cause corruption. User re-customizes and re-downloads, wasting build resources.
Why it happens: Using direct file downloads without resume support. No integrity checking before use. Not leveraging browser download manager capabilities.
Consequences:
- User frustration from failed downloads
- Wasted bandwidth (re-downloading)
- Corrupted ISOs that fail to boot (user blames platform)
- Support burden from "ISO won't boot" issues
Prevention:
- Implement resumable downloads (HTTP Range requests)
- Provide torrent option for large ISOs
- Display SHA256 checksum prominently with instructions to verify
- Use Content-Disposition header to set filename (debate-custom-2026-01-25.iso)
- Consider chunked download with client-side reassembly
- For PWA approach: Use Background Fetch API for large downloads
- Download continues even if tab closed
- Browser shows persistent UI for download progress
- Better reliability on mobile/flaky connections
- Show download progress (not just "downloading...")
- Provide "test ISO in browser" option (emulator) before download
Detection:
- Track download completion rate (started vs finished)
- Monitor download retry patterns
- Analyze user reports of "corrupted ISO"
- Track checksum verification usage
Phase to address: Phase 5 (Distribution) - After ISOs are being generated.
Sources:
Minor Pitfalls
Mistakes that cause annoyance but are relatively easy to fix.
Pitfall 11: Insecure Default Configurations
What goes wrong: Generated ISOs have default passwords (root/toor), SSH enabled with password auth, or autologin configured. User deploys to production and gets compromised.
Why it happens: Copying archiso baseline defaults without hardening. Assuming users will secure their systems post-install. Making convenience the default over security.
Consequences:
- Generated ISOs are insecure by default
- Users deploy vulnerable systems
- Platform reputation damage if incidents occur
- Archiso baseline includes autologin by default
Prevention:
- Override insecure archiso defaults:
- Disable autologin (remove autologin.conf)
- Require password setup during ISO customization
- Disable SSH or require key-based auth
- Provide security checklist in UI:
- "Will this ISO be used on the internet?" → Disable password auth
- "Will this be installed on physical hardware?" → Enable disk encryption
- Show security warnings for risky configurations
- Default to secure, allow opting into convenience features
Detection:
- Static analysis of generated ISO configs
- Alert on ISOs with default passwords or autologin
- Track which security features are enabled/disabled
Phase to address: Phase 3 (Configuration) - When users can customize security settings.
Sources:
Pitfall 12: Inadequate Build Logging and Debugging
What goes wrong: User reports "my build failed" with no details. Build logs are 10MB of pacman output. Error message buried on line 8,432. Impossible to debug without reproduction.
Why it happens: Logging everything without structure. No log aggregation or parsing. Not extracting key errors for display.
Consequences:
- Support burden (need full logs to debug)
- Users can't self-service debug
- Repeated builds to add debug logging
- Difficult to identify systematic issues
Prevention:
- Structure logs with severity levels (INFO, WARN, ERROR)
- Extract and highlight fatal errors in UI
- Provide "debug mode" that shows full logs
- Store build logs for 30 days with unique build ID
- Implement log search/filter in UI
- Add build context to logs (config hash, overlay versions, timestamp)
- Common errors should have KB articles linked
Detection:
- Track support tickets requesting logs
- Monitor build failure rate by error type
- Analyze which errors lead to user retry vs abandonment
Phase to address: Phase 2 (Build Pipeline) - Implement with build infrastructure.
Sources:
Pitfall 13: Package Repository Mirror Failures
What goes wrong: Build relies on mirrors.cachyos.org. Mirror goes down during build. Build fails with "failed to download packages". Build queue backs up.
Why it happens: Single point of failure for package sources. Not implementing mirror fallback. Assuming mirrors have 100% uptime.
Consequences:
- Builds fail during mirror outages
- User sees "server error" with no explanation
- Build queue fills with retries
Prevention:
- Configure multiple mirrors in pacman.conf (fallback)
- Cache frequently-used packages on build infrastructure
- Implement retry logic with exponential backoff
- Monitor mirror health and automatically disable unhealthy mirrors
- Provide user feedback: "Package mirror temporarily unavailable, retrying..."
Detection:
- Monitor mirror response times and availability
- Alert on increased build failures from download errors
- Track which mirrors cause failures
Phase to address: Phase 2 (Build Pipeline) - When package downloading is implemented.
Sources:
Phase-Specific Warnings
| Phase | Likely Pitfall | Mitigation |
|---|---|---|
| Phase 1: Core Infrastructure | Unsandboxed build execution (Critical #1) | Design build isolation from day one using systemd-nspawn or microVMs |
| Phase 1: Core Infrastructure | Non-deterministic builds (Critical #2) | Implement reproducible build practices immediately |
| Phase 2: Build Pipeline | Upstream breaking changes (Critical #3) | Pin repository snapshots, test against staging |
| Phase 2: Build Pipeline | Cache invalidation bugs (Critical #5) | Include dependency tree hash in cache key |
| Phase 3: Overlay System | Dependency hell (Critical #4) | Pre-validate overlay compatibility, implement conflict detection |
| Phase 4: 3D Visualization | Performance on mid-range hardware (Moderate #6) | Test on target hardware, implement LOD and fallbacks |
| Phase 5: Scaling | Build queue starvation (Moderate #7) | Implement build deduplication and autoscaling |
| Phase 6: Polish | Beginner UX (Moderate #9) | User test with Windows refugees, translate jargon |
Validation Checklist
Before launching each phase, verify:
Phase 1 (Infrastructure):
- All builds run in isolated sandboxes (no host system access)
- Same configuration generates identical checksum 3 times in a row
- Build logs structured and searchable
- Failed builds provide actionable error messages
Phase 2 (Build Pipeline):
- Package repository versions pinned/snapshotted
- Mirror fallback configured and tested
- Cache invalidation includes transitive dependencies
- Staging environment tests against latest upstream
Phase 3 (Overlay System):
- File conflict detection runs before build
- Incompatible overlays show warning in UI
- Dependency solver validates combinations
Phase 4 (3D Visualization):
- Achieves 60fps on Intel UHD Graphics 620
- 2D fallback available for low-end devices
- Frame rate monitoring in production
Phase 5 (Scaling):
- Build deduplication prevents duplicate work
- Queue autoscaling based on depth
- p95 wait time under SLA
Phase 6 (Polish):
- User tested with non-technical "Windows refugees"
- Technical jargon translated to plain language
- Download resume support implemented
- Security defaults enabled
Sources
Security & Malware:
- CHAOS RAT Found in Arch Linux AUR Packages
- AUR Malware Packages Exploit Critical Security Flaws Exposed
- Arch Linux Removes Malicious AUR Packages
- Sandboxing untrusted code in 2026
Reproducible Builds:
- Reproducible builds - deterministic build systems
- Linux Kernel reproducible builds
- Three Pillars of Reproducible Builds
Archiso & Build Systems:
Dependency & Package Management:
- Package Conflict Resolution
- Dependency hell - Wikipedia
- CachyOS FAQ & Troubleshooting
- CachyOS dependency errors
Performance & Scaling:
- WebGL vs WebGPU performance in Three.js
- Building Efficient Three.js Scenes
- Faster WebGL with OffscreenCanvas
- Linux package build server scaling
User Experience:
- 10 Linux Mistakes Every Beginner Makes
- Navigating the Switch: Choosing Linux Distro in 2026
- 13 UX Design Mistakes to Avoid in 2026
Progressive Web Apps:
Security & CVEs: