docs: define milestone v1.7 requirements

2026-04-04 11:24:04 +00:00 · 2026-04-04 11:24:04 +00:00 · cc569b4cd6
commit cc569b4cd6
parent e4a103cd9b
1 changed files with 108 additions and 74 deletions
--- a/.planning/REQUIREMENTS.md
+++ b/.planning/REQUIREMENTS.md
@ -1,106 +1,140 @@
-# Requirements: Nexus v1.6 — Voice Pipeline + Minimal Message Bridge
+# Requirements: Nexus v1.7 — Content Generation

 **Defined:** 2026-04-04
-**Core Value:** A fresh onboard asks for ONE thing (root directory), auto-creates PM + Engineer agents, and drops you in the dashboard.
+**Core Value:** A fresh onboard asks for ONE thing (root directory), auto-creates PM + Engineer agents, and drops you in the dashboard — no company names, missions, or corporate language anywhere.

-## v1.6 Requirements
+## v1.7 Requirements

-### Voice Pipeline
+Requirements for Content Generation milestone. Each maps to roadmap phases.

- [x] **VPIPE-01**: User's voice input is transcribed via local Whisper STT with automatic language detection
- [x] **VPIPE-02**: Agent text responses are synthesized to speech via local Piper TTS in under 3 seconds
- [x] **VPIPE-03**: Voice pipeline accepts audio from any transport (web chat, Telegram) via a shared VoicePipelineService
- [x] **VPIPE-04**: Audio from any source is transcoded to WAV 16kHz mono via ffmpeg before Whisper processing
- [x] **VPIPE-05**: Voice mode flag on messages triggers voice-optimized response formatting (no markdown, natural prose)
- [x] **VPIPE-06**: Every voice interaction produces dual output: spoken prose response + full text with code blocks
- [x] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming)
- [x] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS)
+### Infrastructure

-### Web Chat Voice
+- [ ] **INFRA-01**: System processes content generation jobs asynchronously with queued → running → done/failed lifecycle
+- [ ] **INFRA-02**: System pushes job progress updates via SSE to connected clients
+- [ ] **INFRA-03**: Generated content stored in namespaced storage without size restrictions blocking video/images
+- [ ] **INFRA-04**: All generated content tracked in database with source conversation linkage

- [x] **WCHAT-01**: Mic button in chat input starts/stops voice recording with visual state (idle/recording/processing)
- [x] **WCHAT-02**: Recording auto-stops on silence detection via VAD (voice activity detection)
- [x] **WCHAT-03**: Real-time waveform/amplitude visualization displays while recording
- [x] **WCHAT-04**: Voice response audio plays inline in chat message with audio player controls
- [x] **WCHAT-05**: User can toggle voice mode: text only / voice input only / full voice (input + output)
- [x] **WCHAT-06**: Auto-play of voice responses is configurable (on/off in settings)
+### Diagram Generation

-### Telegram Bridge
+- [ ] **DIAG-01**: User can generate diagrams from natural language description
+- [ ] **DIAG-02**: System renders Mermaid syntax to SVG and PNG formats
+- [ ] **DIAG-03**: User can view and edit the Mermaid source for refinement
+- [ ] **DIAG-04**: System supports architecture, flowchart, ERD, sequence, and mind map diagram types
+- [ ] **DIAG-05**: Mermaid rendering enforces strict security level to prevent XSS

- [x] **TGRAM-01**: Single Telegram bot relays text messages bidirectionally between user and agents
- [x] **TGRAM-02**: Agent replies in Telegram are prefixed with agent identity (e.g. `[PM]`, `[Engineer]`)
- [x] **TGRAM-03**: Telegram voice messages are transcribed (OGG → Whisper) and forwarded to agent as text
- [x] **TGRAM-04**: Agent responses can be sent back as Telegram voice notes (TTS → OGG)
- [x] **TGRAM-05**: Telegram bridge uses long polling (no public HTTPS required)
- [x] **TGRAM-06**: Telegram bridge is under 500 lines of code
+### Theme & Palette

-### Onboarding
+- [ ] **THEME-01**: User can pick a seed color and receive a complete palette (background, surface, overlay, text, accents)
+- [ ] **THEME-02**: System generates palette in OKLCH color space with Catppuccin-style naming
+- [ ] **THEME-03**: System validates WCAG AA contrast for all foreground/background pairs
+- [ ] **THEME-04**: User can preview Nexus UI with the generated palette live
+- [ ] **THEME-05**: User can export palette as CSS custom properties, Tailwind config, VS Code theme, or JSON
+- [ ] **THEME-06**: System generates dark and light variants from single seed color
+- [ ] **THEME-07**: User can apply generated theme to their Nexus instance in one click

- [x] **ONBRD-01**: Onboarding hardware probe detects Whisper STT and Piper TTS capability
- [x] **ONBRD-02**: Onboarding presents voice enable/skip step based on hardware detection results
- [x] **ONBRD-03**: Guided BotFather setup flow for Telegram bot token during onboarding
+### Document Generation
+
+- [ ] **DOC-01**: User can generate formatted PDF reports from conversation content
+- [ ] **DOC-02**: User can generate invoices and contracts from templates
+- [ ] **DOC-03**: User can generate one-pagers and API documentation
+
+### Icon Generation
+
+- [ ] **ICON-01**: User can generate SVG icons from a text description
+- [ ] **ICON-02**: System produces icon sets with consistent visual style
+- [ ] **ICON-03**: User can export icons in multiple sizes and formats (SVG, PNG)
+
+### Wallpapers & Visual Assets
+
+- [ ] **WALL-01**: User can generate desktop and mobile wallpapers from a description
+- [ ] **WALL-02**: User can generate social media banners with correct dimensions per platform
+- [ ] **WALL-03**: User can generate Open Graph and social preview images
+- [ ] **WALL-04**: User can generate app icons and favicons in multiple sizes
+
+### Presentations & Video
+
+- [ ] **PRES-01**: User can generate pitch deck presentations from a conversation
+- [ ] **PRES-02**: System renders presentations via Remotion to interactive web or MP4
+- [ ] **PRES-03**: User can generate demo and explainer videos from conversation content
+- [ ] **PRES-04**: System shows render progress via SSE during video generation
+
+### Social Media Content
+
+- [ ] **SOCIAL-01**: User can generate platform-ready posts respecting character limits (Twitter, LinkedIn)
+- [ ] **SOCIAL-02**: User can generate Instagram carousels and thread sequences
+- [ ] **SOCIAL-03**: System suggests relevant hashtags for generated content
+
+### Branding Media Kit
+
+- [ ] **BRAND-01**: User can generate a full brand identity from a single conversation
+- [ ] **BRAND-02**: System produces logo mark (SVG), avatar in multiple sizes
+- [ ] **BRAND-03**: System produces social media profile images and banners per platform
+- [ ] **BRAND-04**: System produces email signature and letterhead templates
+- [ ] **BRAND-05**: System produces a brand guidelines document (PDF)
+- [ ] **BRAND-06**: User can download all brand assets as a zip package
+
+### Format Conversion
+
+- [ ] **CONV-01**: User can convert between image formats (PNG, JPG, SVG, WebP, GIF) via sharp
+- [ ] **CONV-02**: User can convert between audio/video formats via ffmpeg
+- [ ] **CONV-03**: User can convert between document formats (Markdown, HTML, PDF, DOCX) via Pandoc/LibreOffice
+- [ ] **CONV-04**: User can convert between data formats (CSV, JSON, XLSX) via direct tooling
+- [ ] **CONV-05**: User can convert between any format pair via AI-bridged conversion for semantically complex transforms
+- [ ] **CONV-06**: System provides a conversion UI with source/target format selection and drag-drop input
+- [ ] **CONV-07**: User can deep-link to specific conversion flows via URL (e.g. `/convert/png/svg`)
+- [ ] **CONV-08**: System detects available direct converters at startup and degrades gracefully — unavailable direct paths fall through to AI-bridged conversion rather than showing as blocked
+- [ ] **CONV-09**: System validates uploaded file MIME type via magic-byte detection before processing
+
+### Whisper Web Chat
+
+- [ ] **VOICE-01**: User can click a mic button in web chat to record and auto-transcribe via Whisper
+- [ ] **VOICE-02**: User can toggle between text-only, voice-input, and full-voice modes
+- [ ] **VOICE-03**: Voice input works offline with local Whisper model
+
+### Content as Skills
+
+- [ ] **SKILL-01**: Each content type is implemented as an installable Nexus skill
+- [ ] **SKILL-02**: Generalist agent is pre-loaded with a "Creative" skill group
+- [ ] **SKILL-03**: Users can add or remove content type skills through the Skill Aggregator

 ## Future Requirements

-### Voice Enhancements
+Deferred to future release. Tracked but not in current roadmap.

- **VFUT-01**: Wake word detection ("Hey Nexus") for hands-free activation
- **VFUT-02**: Real-time speech-to-speech streaming (full-duplex WebSocket)
- **VFUT-03**: Streaming TTS word-by-word playback
+### AI Image Generation

-### Telegram Enhancements
+- **AIGEN-01**: User can generate images via local Stable Diffusion / ComfyUI
+- **AIGEN-02**: User can generate images via cloud APIs (DALL-E, Midjourney)

- **TFUT-01**: Deep Telegram ↔ web chat session sync via Postgres event bus
- **TFUT-02**: Rich Telegram elements (inline keyboards, threaded replies)
- **TFUT-03**: Per-agent Telegram bots
+### Advanced Voice
+
+- **AVOICE-01**: Wake word detection ("Hey Nexus")
+- **AVOICE-02**: Voice call / real-time audio streaming

 ## Out of Scope

 | Feature | Reason |
 |---------|--------|
-| Real-time speech-to-speech | Entirely different architecture (LiveKit/Pipecat); future milestone |
-| Per-agent Telegram bots | Maintenance nightmare; single bot + agent prefix is correct |
-| Deep Telegram ↔ web chat sync | Requires Postgres event bus; deferred to v2.2 Command Center |
-| Telegram inline keyboards/threads | Thin bridge only; rich elements deferred to Command Center |
-| Wake word detection | Always-on mic; hardware device concern; future |
-| Streaming TTS word-by-word | Audio clicks/gaps; sentence-buffered gives 95% of the benefit |
-| Inline code execution over Telegram | Security risk; bridge is relay only |
-| GSD formatting in Telegram | Stateful session tracking; plain text + Markdown v1 only |
-| Transcription editing before sending | Breaks hands-free flow; show transcript in chat bubble after |
+| AI image generation (SD/DALL-E) | VRAM conflicts with LLM on M4; cloud sends data externally |
+| Social media publishing | API rate limits, auth complexity; generation only for v1.7 |
+| Batch conversion queue | Single-user deployment; one-at-a-time sufficient |
+| Real-time collaborative editing of generated content | Single-user target |
+| Wake word detection | Future consideration |
+| Voice call / real-time audio streaming | Future consideration |

 ## Traceability

+Which phases cover which requirements. Updated during roadmap creation.
+
 | Requirement | Phase | Status |
 |-------------|-------|--------|
-| VPIPE-01 | Phase 36 | Complete |
-| VPIPE-02 | Phase 36 | Complete |
-| VPIPE-03 | Phase 36 | Complete |
-| VPIPE-04 | Phase 36 | Complete |
-| VPIPE-05 | Phase 36 | Complete |
-| VPIPE-06 | Phase 36 | Complete |
-| VPIPE-07 | Phase 39 | Complete |
-| VPIPE-08 | Phase 39 | Complete |
-| WCHAT-01 | Phase 37 | Complete |
-| WCHAT-02 | Phase 37 | Complete |
-| WCHAT-03 | Phase 37 | Complete |
-| WCHAT-04 | Phase 37 | Complete |
-| WCHAT-05 | Phase 37 | Complete |
-| WCHAT-06 | Phase 37 | Complete |
-| TGRAM-01 | Phase 38 | Complete |
-| TGRAM-02 | Phase 38 | Complete |
-| TGRAM-03 | Phase 38 | Complete |
-| TGRAM-04 | Phase 38 | Complete |
-| TGRAM-05 | Phase 38 | Complete |
-| TGRAM-06 | Phase 38 | Complete |
-| ONBRD-01 | Phase 39 | Complete |
-| ONBRD-02 | Phase 39 | Complete |
-| ONBRD-03 | Phase 38 | Complete |
+| — | — | — |

 **Coverage:**
- v1.6 requirements: 23 total
- Mapped to phases: 23
- Unmapped: 0 ✓
+- v1.7 requirements: 52 total
+- Mapped to phases: 0
+- Unmapped: 52

 ---
 *Requirements defined: 2026-04-04*
-*Last updated: 2026-04-03 — traceability populated after roadmap creation*
+*Last updated: 2026-04-04 after initial definition*