docs: complete project research
This commit is contained in:
parent
0abf30b8c1
commit
e4a103cd9b
5 changed files with 1813 additions and 1004 deletions
|
|
@ -1,507 +1,535 @@
|
|||
# Architecture Research
|
||||
|
||||
**Domain:** Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo
|
||||
**Researched:** 2026-04-03
|
||||
**Confidence:** HIGH — based on direct codebase inspection + verified current documentation
|
||||
**Domain:** Content generation integration — Nexus v1.7
|
||||
**Researched:** 2026-04-04
|
||||
**Confidence:** HIGH (based on direct codebase inspection of /opt/nexus)
|
||||
|
||||
---
|
||||
## Standard Architecture
|
||||
|
||||
## System Overview
|
||||
|
||||
v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI.
|
||||
### System Overview
|
||||
|
||||
```
|
||||
+-----------------------------------------------------------------------------------+
|
||||
| UI Layer (React/Vite) |
|
||||
| |
|
||||
| +-------------------------------------------------------------------------+ |
|
||||
| | ChatPanel / PersonalAssistant (MODIFIED) | |
|
||||
| | +---------------------+ +--------------------+ +------------------+ | |
|
||||
| | | VoiceMicButton (NEW)| | WaveformDisplay | | TtsButton (v1.5) | | |
|
||||
| | | silence detection | | (NEW) animated bars| | + auto-play prop | | |
|
||||
| | | auto-send on silence| +--------------------+ +------------------+ | |
|
||||
| | +---------------------+ | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| | | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle | | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| | | VoiceModeToggle (NEW) — text only / voice input / full voice | | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| +-------------------------------------------------------------------------+ |
|
||||
+-----------------------------------------------------------------------------------+
|
||||
| HTTP + SSE
|
||||
+-----------------------------------------------------------------------------------+
|
||||
| Server Layer (Express) |
|
||||
| |
|
||||
| +------------------------------------+ +------------------------------------+ |
|
||||
| | voice.ts (NEW route) | | telegram.ts (NEW route/service) | |
|
||||
| | POST /transcribe (MOVED) | | grammY long-poll process | |
|
||||
| | POST /synthesize (NEW) | | text + voice relay | |
|
||||
| +------------------------------------+ +------------------------------------+ |
|
||||
| | | |
|
||||
| +-----------------v--------------------------------------------v--------------+ |
|
||||
| | voicePipelineService (NEW — core) | |
|
||||
| | transcribe(audioBuffer, format) -> string | |
|
||||
| | synthesize(text, voiceId?) -> Buffer (WAV) | |
|
||||
| | formatForVoice(text) -> { voice: string, full: string } | |
|
||||
| +------------------------------------------------------------------------------+ |
|
||||
| | |
|
||||
| +-----------------v--------------------------------------------------------------+|
|
||||
| | chatService / nexusSettingsService (EXISTING) ||
|
||||
| | conversations . messages . stream SSE . memory . voiceEnabled ||
|
||||
| +--------------------------------------------------------------------------------+|
|
||||
| | |
|
||||
| +-----------------v--------------------------------------------------------------+|
|
||||
| | External Processes (spawned via child_process.spawn / execFile) ||
|
||||
| | whisper-cpp / whisper (STT) piper (TTS) ||
|
||||
| +--------------------------------------------------------------------------------+|
|
||||
+-----------------------------------------------------------------------------------+
|
||||
^
|
||||
| Telegram Bot API (HTTPS long-poll)
|
||||
+--------+------------------------------------------------------------------------+
|
||||
| Telegram (external service) |
|
||||
| User sends text -> bot relays to chatService -> SSE reply -> bot sends back |
|
||||
| User sends voice -> bot downloads OGG -> voicePipelineService.transcribe() |
|
||||
| -> chatService -> reply -> voicePipelineService.synthesize() |
|
||||
| -> bot sends OGG audio reply |
|
||||
+----------------------------------------------------------------------------------+
|
||||
+---------------------------------------------------------------------------------+
|
||||
| UI Layer (React/Vite) |
|
||||
| +------------------+ +------------------+ +--------------+ +----------------+ |
|
||||
| | ChatPanel | | ContentJobViewer | | ThemePreview | | DiagramRenderer| |
|
||||
| | (existing, | | (new) | | (new) | | (new, wraps | |
|
||||
| | minor extension)| | progress+result | | CSS vars | | mermaid dep) | |
|
||||
| +--------+---------+ +--------+---------+ +------+-------+ +-------+--------+ |
|
||||
| | | | | |
|
||||
+-----------|--------------------|--------------------|-----------------|------------+
|
||||
| HTTP/SSE | HTTP/SSE | HTTP | (client-side)
|
||||
+-----------|--------------------|--------------------|-------------------------+
|
||||
| | API Layer (Express) | |
|
||||
| +--------v----------------------------------------------v------------------+ |
|
||||
| | /api/companies/:id/content-jobs (new) | |
|
||||
| | /api/content-jobs/:id (new) | |
|
||||
| | /api/companies/:id/themes/generate (new) | |
|
||||
| +-------------------------------------------------------------------------+ |
|
||||
+---------------------------------------------------------------------------------+
|
||||
| Service Layer (Node.js) |
|
||||
| +-------------------+ +------------------+ +---------------------------------+|
|
||||
| | contentJobService | | themeEngineService| | renderPipelineService ||
|
||||
| | (new) | | (new) | | (new) ||
|
||||
| | enqueue, status, | | palette gen, | | routes jobs to renderer adapters ||
|
||||
| | list | | WCAG check,export | | ||
|
||||
| +--------+----------+ +------------------+ +---------------+-----------------+|
|
||||
| | | |
|
||||
| +--------v----------------------------------------------------v--------------+ |
|
||||
| | Renderer Adapters (new, behind interface) | |
|
||||
| | +-------------+ +------------+ +----------+ +-----------+ +----------+ | |
|
||||
| | | Mermaid | | SVG | | Remotion | | PDF | | Image | | |
|
||||
| | | (isomorphic)| | (generator)| | (CLI) | | (Puppeteer| | (Sharp) | | |
|
||||
| | +-------------+ +------------+ +----------+ +-----------+ +----------+ | |
|
||||
| +-------------------------------------------------------------------------+ |
|
||||
+---------------------------------------------------------------------------------+
|
||||
| Storage + Events Layer (existing, minimally extended) |
|
||||
| +------------------+ +--------------------+ +-----------------------------+ |
|
||||
| | StorageService | | publishLiveEvent | | assets table (existing) | |
|
||||
| | (existing) | | (existing +3 types)| | content_jobs table (new) | |
|
||||
| +------------------+ +--------------------+ +-----------------------------+ |
|
||||
+---------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
---
|
||||
### Component Responsibilities
|
||||
|
||||
## Integration Points: New vs. Existing
|
||||
|
||||
### What Stays Unchanged
|
||||
|
||||
| Component | Location | Status |
|
||||
|-----------|----------|--------|
|
||||
| `chatService` | `server/src/services/chat.ts` | No changes — voice pipeline uses it as-is |
|
||||
| `nexusSettingsService` | `server/src/services/nexus-settings.ts` | Extend schema only (add `voiceMode`, `telegramToken`) |
|
||||
| `chatFileRoutes` | `server/src/routes/chat-files.ts` | `/transcribe` moves out; file upload stays |
|
||||
| `usePiperTts` | `ui/src/hooks/usePiperTts.ts` | No changes — TtsButton continues using browser WASM |
|
||||
| `TtsButton` | `ui/src/components/TtsButton.tsx` | Add auto-play prop only |
|
||||
| SSE stream endpoint | `server/src/routes/chat.ts` | No changes — Telegram bridge calls services directly |
|
||||
| DB schema | `packages/db` | No changes — voice is file/process, not a DB column |
|
||||
|
||||
### What Changes (MODIFIED)
|
||||
|
||||
| Component | Location | Change |
|
||||
|-----------|----------|--------|
|
||||
| `VoiceRecordButton` | `ui/src/components/VoiceRecordButton.tsx` | Add silence detection, waveform data emission, auto-send on silence |
|
||||
| `ChatInput` | `ui/src/components/ChatInput.tsx` | Wire new VoiceMicButton, add voice mode prop |
|
||||
| `ChatMessage` | `ui/src/components/ChatMessage.tsx` | Show voice_mode badge, show dual output collapse/expand |
|
||||
| `nexusSettingsSchema` | `server/src/services/nexus-settings.ts` | Add `voiceMode` enum and `telegramToken` optional string |
|
||||
| `app.ts` | `server/src/app.ts` | Register `voiceRoutes`, `telegramRoutes` |
|
||||
| `createMessageSchema` | `packages/shared/src/validators/chat.ts` | Add `voiceMode: z.boolean().optional()` flag on messages |
|
||||
| `ChatMessage` type | `packages/shared/src/types/chat.ts` | Add `voiceMode: boolean | null` field |
|
||||
| `chat-files.ts` | `server/src/routes/chat-files.ts` | Remove `/transcribe` handler (moved to voice.ts) |
|
||||
|
||||
### What Is New (NEW)
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| `voicePipelineService` | `server/src/services/voice-pipeline.ts` | Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge |
|
||||
| `voice.ts` (route) | `server/src/routes/voice.ts` | `POST /api/transcribe`, `POST /api/synthesize` — thin HTTP wrappers |
|
||||
| `telegram.ts` (service) | `server/src/services/telegram.ts` | grammY bot init, long-poll loop, message relay, voice relay |
|
||||
| `telegram.ts` (route) | `server/src/routes/telegram.ts` | `GET /api/telegram/status`, `POST /api/telegram/token` management endpoints |
|
||||
| `VoiceMicButton` | `ui/src/components/VoiceMicButton.tsx` | Enhanced mic button with silence detection and waveform display |
|
||||
| `WaveformDisplay` | `ui/src/components/WaveformDisplay.tsx` | Animated audio waveform bars using AnalyserNode |
|
||||
| `VoiceModeToggle` | `ui/src/components/VoiceModeToggle.tsx` | Three-state toggle: text only / voice input / full voice |
|
||||
| `useVoiceMode` | `ui/src/hooks/useVoiceMode.ts` | Reads/writes voice mode setting via `/api/nexus-settings` |
|
||||
| `useSilenceDetection` | `ui/src/hooks/useSilenceDetection.ts` | Web Audio API AnalyserNode watching for 1.5s silence threshold |
|
||||
|
||||
---
|
||||
|
||||
## Component Boundaries
|
||||
|
||||
### voicePipelineService (Core)
|
||||
|
||||
This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other.
|
||||
|
||||
| Method | Input | Output | Implementation |
|
||||
|--------|-------|--------|----------------|
|
||||
| `transcribe(buffer, format)` | `Buffer`, `"webm" or "ogg"` | `Promise<string>` | Writes temp file, uses `execFile` (not `exec`) to spawn `whisper-cpp` or `whisper` CLI, reads stdout, cleans up |
|
||||
| `synthesize(text, voiceId?)` | `string`, optional voiceId | `Promise<Buffer>` | Spawns `piper` CLI via `spawn`, pipes text to stdin, collects WAV stdout |
|
||||
| `formatForVoice(text)` | `string` | `{ voice: string; full: string }` | Strips code blocks and markdown for voice; returns both variants |
|
||||
|
||||
The `transcribe` method extends the existing `/transcribe` implementation from `chat-files.ts` by adding an `ogg` format path alongside the existing `webm` path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved.
|
||||
|
||||
**Why a dedicated service vs. inline in routes:**
|
||||
The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation.
|
||||
|
||||
### telegram service
|
||||
|
||||
A thin relay, not a feature-rich bot. It:
|
||||
1. Holds a single grammY `Bot` instance, initialized when `telegramToken` is set in nexus-settings
|
||||
2. Routes text messages to `chatService.addMessage()` then collects AI response via `puterProxyService.chatStream()`
|
||||
3. Routes voice messages — downloads OGG file, calls `voicePipelineService.transcribe()`, then same text path
|
||||
4. If `voiceMode === "full_voice"`: calls `voicePipelineService.synthesize()`, sends audio back via `ctx.replyWithAudio()`
|
||||
5. Prefixes agent name on replies: `[Agent Name]: message text`
|
||||
|
||||
**No per-user conversation tracking.** All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md.
|
||||
|
||||
### Voice Route vs. Chat Files Route
|
||||
|
||||
The existing `/transcribe` endpoint lives inside `chatFileRoutes` in `chat-files.ts`. For v1.6, the endpoint moves to a dedicated `voice.ts` route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file.
|
||||
|
||||
Moving the handler reduces merge conflict surface on future upstream rebases of `chat-files.ts`.
|
||||
|
||||
---
|
||||
| Component | Responsibility | Status | Notes |
|
||||
|-----------|----------------|--------|-------|
|
||||
| `contentJobService` | Queue and track async render jobs; emit live events on status change | New | Factory function, matches `chatService` pattern |
|
||||
| `renderPipelineService` | Route render requests to the correct renderer adapter | New | Strategy pattern over adapters |
|
||||
| `themeEngineService` | Palette generation, WCAG AA validation, CSS/JSON/Tailwind exports | New | Pure computation, no DB, deterministic |
|
||||
| `mermaidRendererAdapter` | Mermaid DSL string to SVG buffer, server-side | New | Uses `@mermaid-js/mermaid-isomorphic`; no Chromium needed |
|
||||
| `remotionRendererAdapter` | Invoke Remotion CLI subprocess, return MP4/WebM path | New | Subprocess; outputs go to storage namespace `generated/videos` |
|
||||
| `svgGeneratorAdapter` | Template-based SVG generation (icons, banners, placeholders) | New | No binary deps; pure string construction + existing sanitizer |
|
||||
| `pdfRendererAdapter` | HTML to PDF via Puppeteer (arm64 Chromium on M4) | New | Subprocess; Puppeteer arm64 works on Apple Silicon |
|
||||
| `imageProcessorAdapter` | Composite and resize via Sharp | Modified | Sharp already in `server/package.json`; extend for content use |
|
||||
| `placeholderService` | Manifest tracking for draft assets | Existing | Already implemented; optionally extend PlaceholderEntry with `contentJobId` |
|
||||
| `assetService` | CRUD for the `assets` table | Existing | Already handles `createdByAgentId`; use as-is |
|
||||
| `StorageService` | Provider-agnostic blob storage | Existing | Use `generated/` namespace prefix for all new content |
|
||||
| `publishLiveEvent` | SSE fan-out to UI subscribers | Existing | Extend `LIVE_EVENT_TYPES` with 3 new content job event types |
|
||||
| `ContentJobViewer` (UI) | Poll/stream job status; show progress, render result inline | New | Subscribes to SSE live events |
|
||||
| `DiagramRenderer` (UI) | Client-side Mermaid render using existing `mermaid` dep | New | `mermaid ^11.12.0` already in `ui/package.json` |
|
||||
| `ThemePreview` (UI) | Live palette preview via CSS custom properties | New | No server round-trip for preview |
|
||||
| `ContentGallery` (UI) | Workspace page showing all generated assets | New | Pagination via `assetService.list` |
|
||||
|
||||
## Recommended Project Structure
|
||||
|
||||
New files follow existing monorepo conventions: factory functions, co-located types, no class syntax.
|
||||
|
||||
```
|
||||
server/src/
|
||||
app.ts # MODIFY: register voiceRoutes, telegramRoutes
|
||||
routes/
|
||||
chat-files.ts # MODIFY: remove /transcribe handler (moved to voice.ts)
|
||||
voice.ts # NEW: POST /transcribe, POST /synthesize
|
||||
nexus-settings.ts # MODIFY: expose voiceMode + telegramToken fields
|
||||
telegram.ts # NEW: GET /telegram/status, POST /telegram/token
|
||||
services/
|
||||
voice-pipeline.ts # NEW: transcribe(), synthesize(), formatForVoice()
|
||||
telegram.ts # NEW: grammY bot lifecycle + relay logic
|
||||
nexus-settings.ts # MODIFY: add voiceMode + telegramToken to schema
|
||||
├── services/
|
||||
│ ├── content-job.ts # contentJobService factory
|
||||
│ ├── render-pipeline.ts # renderPipelineService — adapter dispatch
|
||||
│ ├── theme-engine.ts # themeEngineService — pure palette computation
|
||||
│ └── renderers/
|
||||
│ ├── index.ts # RendererAdapter interface + barrel
|
||||
│ ├── mermaid-renderer.ts # Mermaid DSL -> SVG (server-side isomorphic)
|
||||
│ ├── remotion-renderer.ts # Remotion CLI subprocess wrapper
|
||||
│ ├── svg-generator.ts # Template SVG (icons, placeholders, banners)
|
||||
│ └── pdf-renderer.ts # HTML -> PDF via Puppeteer
|
||||
├── routes/
|
||||
│ ├── content-jobs.ts # GET/POST /companies/:id/content-jobs
|
||||
│ └── themes.ts # POST /companies/:id/themes/generate
|
||||
└── types/
|
||||
└── content.ts # Server-internal ContentJobType, ContentJobStatus
|
||||
|
||||
ui/src/
|
||||
components/
|
||||
VoiceMicButton.tsx # NEW: replaces VoiceRecordButton in ChatInput
|
||||
WaveformDisplay.tsx # NEW: animated bars from AnalyserNode data
|
||||
VoiceModeToggle.tsx # NEW: 3-state toggle (text / voice-in / full-voice)
|
||||
VoiceRecordButton.tsx # KEEP as-is (still used in file upload contexts)
|
||||
TtsButton.tsx # MODIFY: add autoPlay prop
|
||||
ChatInput.tsx # MODIFY: add VoiceModeToggle, swap in VoiceMicButton
|
||||
ChatMessage.tsx # MODIFY: voice_mode badge + dual output expand
|
||||
hooks/
|
||||
useVoiceMode.ts # NEW: reads/writes voiceMode setting
|
||||
useSilenceDetection.ts # NEW: AnalyserNode silence threshold
|
||||
usePiperTts.ts # KEEP as-is (browser-side TTS unchanged)
|
||||
packages/db/src/
|
||||
├── schema/
|
||||
│ └── content_jobs.ts # NEW table (upstream-safe, no upstream equivalent)
|
||||
└── migrations/
|
||||
└── NNNN_add_content_jobs.sql
|
||||
|
||||
packages/shared/src/
|
||||
validators/chat.ts # MODIFY: add voiceMode flag to createMessageSchema
|
||||
types/chat.ts # MODIFY: add voiceMode field to ChatMessage
|
||||
├── types/
|
||||
│ └── content.ts # ContentJob, ContentJobStatus shared types
|
||||
└── constants.ts # LIVE_EVENT_TYPES extended (+3 content.job.* types)
|
||||
|
||||
packages/
|
||||
└── remotion-compositions/ # NEW workspace package
|
||||
├── package.json
|
||||
└── src/
|
||||
└── index.ts # Remotion composition definitions
|
||||
|
||||
ui/src/
|
||||
├── components/
|
||||
│ ├── ContentJobViewer.tsx # Job progress + result display
|
||||
│ ├── ContentJobCard.tsx # Compact job status card
|
||||
│ ├── DiagramRenderer.tsx # Mermaid client-side wrapper
|
||||
│ ├── ThemePreview.tsx # Live palette preview
|
||||
│ └── GeneratedAssetCard.tsx # Thumbnail + download + metadata
|
||||
└── pages/
|
||||
└── ContentGallery.tsx # Gallery of generated assets per workspace
|
||||
```
|
||||
|
||||
---
|
||||
### Structure Rationale
|
||||
|
||||
- **`server/src/services/renderers/`**: Isolates binary-dependent adapters behind a shared `RendererAdapter` interface. New renderers plug in without touching the core job service.
|
||||
- **`content_jobs` table**: Separate from `assets`. A job tracks render lifecycle (queued to running to done/failed); on success it writes an `assets` row and records the `assetId`. This mirrors how `heartbeat_runs` tracks execution separately from its outputs.
|
||||
- **`packages/remotion-compositions/`**: Remotion compositions must be bundled ahead of time. Keeping them in a dedicated workspace package lets the bundle step run once at startup, not on every render request.
|
||||
- **Content as skills**: Skills (`company_skills`) are Markdown instruction files. Content type skills tell agents which `/api/companies/:id/content-jobs` endpoint to call and with what parameters. No new schema needed.
|
||||
|
||||
## Architectural Patterns
|
||||
|
||||
### Pattern 1: Transport-Agnostic Voice Service
|
||||
### Pattern 1: Async Job with SSE Progress
|
||||
|
||||
**What:** A server service (`voicePipelineService`) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly.
|
||||
**What:** Long-running renders (Remotion, Puppeteer PDF) run asynchronously. The service creates a `content_jobs` row with `status: "queued"`, immediately returns the job record, then spawns the renderer. Live events push progress to the UI over the existing SSE stream.
|
||||
|
||||
**When to use:** Any time two transports (web + bot) need the same capability.
|
||||
**When to use:** Any render taking more than ~200ms: Remotion, PDF, large Mermaid diagrams. Fast operations (SVG generation, theme palette) can be synchronous HTTP.
|
||||
|
||||
**Trade-offs:** Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently.
|
||||
**Trade-offs:** One DB row per render. Adds durable history of what was generated. Acceptable at solo-user scale.
|
||||
|
||||
**Shape:**
|
||||
**Example:**
|
||||
```typescript
|
||||
// server/src/services/voice-pipeline.ts
|
||||
export function voicePipelineService() {
|
||||
// Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern
|
||||
async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise<string>;
|
||||
async function synthesize(text: string, voiceId?: string): Promise<Buffer>;
|
||||
function formatForVoice(text: string): { voice: string; full: string };
|
||||
return { transcribe, synthesize, formatForVoice };
|
||||
// server/src/services/content-job.ts
|
||||
export function contentJobService(db: Db, storage: StorageService) {
|
||||
return {
|
||||
async enqueue(companyId: string, input: ContentJobInput): Promise<ContentJob> {
|
||||
const [row] = await db
|
||||
.insert(contentJobs)
|
||||
.values({ companyId, type: input.type, params: input.params, status: "queued" })
|
||||
.returning();
|
||||
publishLiveEvent({ companyId, type: "content.job.started", payload: { jobId: row.id } });
|
||||
// Non-blocking — kick off render
|
||||
renderPipelineService(storage).render(row)
|
||||
.then(async (result) => {
|
||||
await db.update(contentJobs)
|
||||
.set({ status: "done", assetId: result.assetId, completedAt: new Date() })
|
||||
.where(eq(contentJobs.id, row.id));
|
||||
publishLiveEvent({ companyId, type: "content.job.done", payload: { jobId: row.id, assetId: result.assetId } });
|
||||
})
|
||||
.catch(async (err) => {
|
||||
await db.update(contentJobs)
|
||||
.set({ status: "failed", errorMessage: String(err) })
|
||||
.where(eq(contentJobs.id, row.id));
|
||||
publishLiveEvent({ companyId, type: "content.job.failed", payload: { jobId: row.id } });
|
||||
});
|
||||
return toContentJob(row);
|
||||
},
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
The existing `/transcribe` handler in `chat-files.ts` already uses `promisify(execFile)` — this pattern is the right model. The service wraps it with format selection (`webm` vs `ogg`) and the same whisper-cpp → openai-whisper cascade.
|
||||
### Pattern 2: RendererAdapter Interface
|
||||
|
||||
### Pattern 2: Thin Telegram Relay
|
||||
**What:** Each renderer implements a shared interface. `renderPipelineService` selects the adapter based on `ContentJobType`. Adding a new renderer requires only: (a) implement the interface, (b) register in the dispatch table.
|
||||
|
||||
**What:** The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram.
|
||||
**When to use:** Every new content type.
|
||||
|
||||
**When to use:** Building a disposable bridge that will be replaced by a richer implementation later.
|
||||
**Trade-offs:** Thin abstraction, no framework needed. Appropriate for the codebase size and single-user scale.
|
||||
|
||||
**Trade-offs:** No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope.
|
||||
|
||||
**Shape:**
|
||||
**Example:**
|
||||
```typescript
|
||||
// server/src/services/telegram.ts
|
||||
import { Bot } from "grammy";
|
||||
|
||||
export function telegramService(db: Db) {
|
||||
let bot: Bot | null = null;
|
||||
|
||||
function start(token: string): void; // idempotent, long-poll
|
||||
function stop(): void;
|
||||
function isRunning(): boolean;
|
||||
|
||||
return { start, stop, isRunning };
|
||||
// server/src/services/renderers/index.ts
|
||||
export interface RendererAdapter {
|
||||
type: ContentJobType;
|
||||
render(
|
||||
params: Record<string, unknown>,
|
||||
storage: StorageService,
|
||||
companyId: string
|
||||
): Promise<{ objectKey: string; contentType: string; byteSize: number }>;
|
||||
}
|
||||
```
|
||||
|
||||
The bot calls `chatService(db)` and `puterProxyService(db)` directly — no HTTP round-trip to the same server.
|
||||
### Pattern 3: Content Types as Skill Files
|
||||
|
||||
### Pattern 3: Voice Mode Flag on Messages
|
||||
**What:** A Mermaid-generation "skill" is a Markdown file in `company_skills` that instructs agents: "When asked for a diagram, call `POST /api/companies/:id/content-jobs` with `{type: 'mermaid', params: {dsl: '...'}}` and wait for `content.job.done` event." No new schema required.
|
||||
|
||||
**What:** Each message carries an optional `voiceMode: boolean` flag. When `true`, the server formats the response for voice (dual output: `voice` + `full`), and the client auto-plays TTS and shows the full text in a collapsible block.
|
||||
**When to use:** All content types — this is how they become installable skills.
|
||||
|
||||
**When to use:** Differentiating voice-initiated messages from text messages within the same conversation.
|
||||
**Trade-offs:** Agent must know the API contract, included in the skill markdown. Works with all adapters (Claude Code, Hermes, Ollama) since skills are plain text.
|
||||
|
||||
**Trade-offs:** Adds a field to `createMessageSchema` and the `ChatMessage` type. The field is optional and defaults to `false`, so existing messages and the upstream schema are not broken.
|
||||
### Pattern 4: Theme Engine as Pure Function, Preview Client-Side
|
||||
|
||||
**Schema change:**
|
||||
```typescript
|
||||
// packages/shared/src/validators/chat.ts — additive only
|
||||
export const createMessageSchema = z.object({
|
||||
role: z.enum(["user", "assistant", "system"]),
|
||||
content: z.string().min(1).max(100_000),
|
||||
agentId: z.string().uuid().optional(),
|
||||
messageType: z.string().optional(),
|
||||
voiceMode: z.boolean().optional(), // NEW in v1.6
|
||||
});
|
||||
```
|
||||
**What:** Theme generation is a pure computation: seed hex color in, palette object out. Preview injects CSS custom properties directly into the DOM — no server round-trip. Saving a theme stores the palette JSON via `StorageService`.
|
||||
|
||||
### Pattern 4: Direct Service Calls in Telegram Bridge
|
||||
**When to use:** Theme generation and live preview.
|
||||
|
||||
**What:** The Telegram bot does not call the Express HTTP API to get AI responses. It calls `chatService(db)` and `puterProxyService(db)` as regular TypeScript function calls within the same server process.
|
||||
|
||||
**When to use:** Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip.
|
||||
|
||||
**Trade-offs:** Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool.
|
||||
|
||||
**Why not HTTP:** A `fetch("http://localhost:PORT/api/...")` call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development.
|
||||
|
||||
### Pattern 5: grammY Long-Poll for Single-User Local Deployment
|
||||
|
||||
**What:** Use grammY `bot.start()` (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running.
|
||||
|
||||
**When to use:** Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain.
|
||||
|
||||
**Trade-offs:** Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use.
|
||||
|
||||
**Lifecycle:**
|
||||
- Start: `nexusSettingsService().get()` finds `telegramToken` set → `telegramService(db).start(token)`
|
||||
- Stop: `server.close()` → `telegramService(db).stop()`
|
||||
- Runtime toggle: `POST /api/telegram/token` updates nexus-settings and calls start/stop
|
||||
|
||||
---
|
||||
**Trade-offs:** No server latency for preview feedback. The JSON is reusable for all export formats (CSS, Tailwind config, design tokens).
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Web Voice Input Flow
|
||||
### Async Content Job Request Flow
|
||||
|
||||
```
|
||||
User holds mic button
|
||||
Agent / UI
|
||||
|
|
||||
v
|
||||
VoiceMicButton: MediaRecorder + AnalyserNode
|
||||
|
|
||||
v (silence detected after 1.5s or stop pressed)
|
||||
POST /api/transcribe {audio: webm blob}
|
||||
POST /api/companies/:id/content-jobs
|
||||
{ type: "mermaid", params: { dsl: "graph TD..." } }
|
||||
|
|
||||
v
|
||||
voice.ts route -> voicePipelineService.transcribe(buffer, "webm")
|
||||
contentJobService.enqueue()
|
||||
INSERT content_jobs WHERE status = "queued"
|
||||
publishLiveEvent("content.job.started")
|
||||
(non-blocking) -> renderPipelineService.render()
|
||||
|
|
||||
v (whisper-cpp or openai-whisper CLI via execFile)
|
||||
{ text: "transcribed text" }
|
||||
v (async)
|
||||
renderPipelineService
|
||||
selects MermaidRendererAdapter
|
||||
|
|
||||
v
|
||||
ChatInput fills textarea -> user sends (message tagged voiceMode: true)
|
||||
MermaidRendererAdapter.render()
|
||||
Mermaid DSL -> SVG Buffer (via @mermaid-js/mermaid-isomorphic)
|
||||
|
|
||||
v
|
||||
POST /conversations/:id/stream -> chatService + puterProxyService
|
||||
|
|
||||
v (SSE tokens arrive)
|
||||
ChatMessage with voice_mode badge + dual output (voice text + full text collapsible)
|
||||
StorageService.putFile()
|
||||
objectKey: "{companyId}/generated/diagrams/2026/04/04/{uuid}-diagram.svg"
|
||||
|
|
||||
v
|
||||
TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5)
|
||||
assetService.create()
|
||||
INSERT assets (objectKey, contentType, byteSize, createdByAgentId)
|
||||
|
|
||||
v
|
||||
contentJobService callback
|
||||
UPDATE content_jobs SET status = "done", asset_id = ...
|
||||
publishLiveEvent("content.job.done", { jobId, assetId })
|
||||
|
|
||||
v
|
||||
UI (SSE subscriber)
|
||||
ContentJobViewer receives event -> fetches asset URL -> renders preview inline
|
||||
```
|
||||
|
||||
### Server-Side TTS Flow (POST /synthesize)
|
||||
### Theme Generation Flow (Synchronous)
|
||||
|
||||
```
|
||||
POST /api/synthesize { text, voiceId? }
|
||||
User picks seed color
|
||||
|
|
||||
v
|
||||
voice.ts route -> voicePipelineService.synthesize(text)
|
||||
ThemePreview component
|
||||
(no server round-trip — CSS custom properties injected directly into DOM)
|
||||
|
|
||||
v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout)
|
||||
Response: Content-Type audio/wav, Buffer body
|
||||
v (on "Save Theme")
|
||||
POST /api/companies/:id/themes/generate
|
||||
{ seedColor: "#4a90d9" }
|
||||
|
|
||||
v
|
||||
Client: new Audio(URL.createObjectURL(blob)).play()
|
||||
themeEngineService.generate()
|
||||
Compute palette (tints, shades, semantic tokens, WCAG AA checks)
|
||||
Returns palette JSON
|
||||
|
|
||||
v
|
||||
StorageService.putFile()
|
||||
objectKey: "{companyId}/themes/{uuid}-theme.json"
|
||||
contentType: "application/json"
|
||||
|
|
||||
v
|
||||
assetService.create() -> 201 { assetId, downloadUrl }
|
||||
```
|
||||
|
||||
Note: Server-side `/synthesize` is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side `usePiperTts` WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward.
|
||||
|
||||
### Telegram Text Message Flow
|
||||
### Mermaid Client-Side Fast Path
|
||||
|
||||
```
|
||||
Telegram user sends text
|
||||
Agent sends message with ```mermaid code block
|
||||
|
|
||||
v
|
||||
grammY bot.on("message:text") handler
|
||||
ChatMarkdownMessage (existing component, minor extension)
|
||||
Detects ```mermaid fence
|
||||
|
|
||||
v
|
||||
telegramService: resolveOrCreateConversation(db)
|
||||
|
|
||||
v
|
||||
chatService(db).addMessage(conversationId, { role: "user", content: text })
|
||||
|
|
||||
v
|
||||
telegramService: collect full response via puterProxyService(db).chatStream()
|
||||
|
|
||||
v (if voiceMode !== "full_voice")
|
||||
ctx.reply("[AgentName]: full_response_text")
|
||||
|
||||
| (if voiceMode === "full_voice")
|
||||
v
|
||||
voicePipelineService.formatForVoice(response) -> { voice, full }
|
||||
ctx.reply("[AgentName]: " + full) -- text message with full details
|
||||
|
|
||||
v
|
||||
voicePipelineService.synthesize(voice) -> WAV Buffer
|
||||
ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg"))
|
||||
DiagramRenderer (new component, wraps existing mermaid dep)
|
||||
Calls mermaid.render() client-side (mermaid ^11.12 already in ui/package.json)
|
||||
Displays SVG inline
|
||||
"Save as asset" button -> POST /api/companies/:id/content-jobs (server path)
|
||||
```
|
||||
|
||||
### Telegram Voice Message Flow
|
||||
### State Transitions: content_jobs
|
||||
|
||||
```
|
||||
Telegram user sends voice note (OGG Opus format)
|
||||
|
|
||||
v
|
||||
grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer
|
||||
|
|
||||
v
|
||||
voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text
|
||||
|
|
||||
v
|
||||
(same path as Telegram text message above)
|
||||
queued -> running (renderPipelineService picks up job)
|
||||
running -> done (renderer returns, asset created)
|
||||
running -> failed (renderer throws, error recorded)
|
||||
```
|
||||
|
||||
### nexus-settings Schema Evolution
|
||||
## New vs Modified: Explicit Breakdown
|
||||
|
||||
```
|
||||
v1.5: { mode, voiceEnabled }
|
||||
v1.6: { mode, voiceEnabled, voiceMode, telegramToken }
|
||||
### New (does not exist)
|
||||
|
||||
voiceMode: "text" | "voice_input" | "full_voice" (default: "text")
|
||||
telegramToken: string | undefined (set by user via UI or POST /telegram/token)
|
||||
| Artifact | Type | Purpose |
|
||||
|----------|------|---------|
|
||||
| `content_jobs` table | DB schema + migration | Track async render job lifecycle |
|
||||
| `contentJobService` | Server service | Enqueue, status, list jobs |
|
||||
| `renderPipelineService` | Server service | Route jobs to renderer adapters |
|
||||
| `themeEngineService` | Server service | Palette generation + WCAG validation |
|
||||
| `mermaidRendererAdapter` | Server renderer | Server-side Mermaid to SVG |
|
||||
| `remotionRendererAdapter` | Server renderer | Remotion CLI to MP4/WebM |
|
||||
| `svgGeneratorAdapter` | Server renderer | Template SVG generation |
|
||||
| `pdfRendererAdapter` | Server renderer | HTML to PDF via Puppeteer |
|
||||
| `content-jobs.ts` route | API route | Create and list content jobs |
|
||||
| `themes.ts` route | API route | Synchronous theme generation |
|
||||
| `packages/remotion-compositions/` | Workspace package | Remotion composition definitions |
|
||||
| `packages/shared/src/types/content.ts` | Shared type | `ContentJob`, `ContentJobStatus` |
|
||||
| `ContentJobViewer` | UI component | Job progress + result display |
|
||||
| `DiagramRenderer` | UI component | Client-side Mermaid wrapper |
|
||||
| `ThemePreview` | UI component | Live palette preview |
|
||||
| `GeneratedAssetCard` | UI component | Asset thumbnail + actions |
|
||||
| `ContentGallery` | UI page | Workspace content library |
|
||||
|
||||
### Modified (exists, needs extension)
|
||||
|
||||
| Artifact | Change | Risk |
|
||||
|----------|--------|------|
|
||||
| `packages/shared/src/constants.ts` | Add 3 new `LIVE_EVENT_TYPES`: `content.job.started`, `content.job.done`, `content.job.failed` | LOW — additive only |
|
||||
| `server/src/app.ts` | Mount `contentJobRoutes` and `themeRoutes` | LOW — two lines |
|
||||
| `ChatMarkdownMessage` | Detect triple-backtick mermaid fence; render via `DiagramRenderer` | MEDIUM — existing component, test carefully |
|
||||
| `assetService` | Add `list(companyId, opts)` method for gallery pagination | LOW — new method, no schema change |
|
||||
| `packages/db/src/schema/index.ts` | Export `contentJobs` table | LOW |
|
||||
| `packages/db/src/index.ts` | Export `contentJobs` from db package | LOW |
|
||||
|
||||
## Data Model
|
||||
|
||||
### New Table: `content_jobs`
|
||||
|
||||
No changes to existing tables. Standalone new table; upstream-safe because Paperclip has no content generation system.
|
||||
|
||||
```sql
|
||||
CREATE TABLE content_jobs (
|
||||
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||||
company_id UUID NOT NULL REFERENCES companies(id),
|
||||
type TEXT NOT NULL, -- 'mermaid' | 'remotion' | 'pdf' | 'svg' | 'theme' | 'image'
|
||||
status TEXT NOT NULL DEFAULT 'queued', -- 'queued' | 'running' | 'done' | 'failed'
|
||||
params JSONB NOT NULL DEFAULT '{}', -- renderer-specific input params
|
||||
asset_id UUID REFERENCES assets(id), -- set on success
|
||||
error_message TEXT, -- set on failure
|
||||
created_by_agent_id UUID REFERENCES agents(id),
|
||||
created_by_user_id TEXT,
|
||||
started_at TIMESTAMPTZ,
|
||||
completed_at TIMESTAMPTZ,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
|
||||
);
|
||||
CREATE INDEX content_jobs_company_status_idx ON content_jobs(company_id, status);
|
||||
CREATE INDEX content_jobs_company_created_idx ON content_jobs(company_id, created_at DESC);
|
||||
```
|
||||
|
||||
`voiceMode` is a workspace-level setting (not per-agent). The three states map to:
|
||||
- `"text"`: mic button transcribes to text input, TTS manual-only, Telegram text-only
|
||||
- `"voice_input"`: mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out
|
||||
- `"full_voice"`: mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out
|
||||
### Storage Namespaces (extends existing StorageService path conventions)
|
||||
|
||||
---
|
||||
```
|
||||
{companyId}/generated/diagrams/YYYY/MM/DD/{uuid}-{name}.svg
|
||||
{companyId}/generated/videos/YYYY/MM/DD/{uuid}-{name}.mp4
|
||||
{companyId}/generated/pdfs/YYYY/MM/DD/{uuid}-{name}.pdf
|
||||
{companyId}/generated/images/YYYY/MM/DD/{uuid}-{name}.png
|
||||
{companyId}/generated/icons/YYYY/MM/DD/{uuid}-{name}.svg
|
||||
{companyId}/themes/{uuid}-theme.json
|
||||
{companyId}/placeholders/{uuid}-placeholder.svg
|
||||
```
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility.
|
||||
|
||||
| Concern | At 1 user (target) | Notes |
|
||||
|---------|-------------------|-------|
|
||||
| STT latency | whisper-cpp base.en on M4: ~1-3s | Acceptable; shows transcribing spinner |
|
||||
| TTS latency | piper CLI on M4: ~0.3-1s for short text | <3s target met |
|
||||
| Telegram poll | grammY `bot.start()`, 1 process | Adequate for <5,000 msgs/hour |
|
||||
| Memory overhead | ~10-20MB for polling loop | Acceptable on 16GB+ M4 |
|
||||
| Piper model | First server-side synthesize: cold start | Piper loads model into memory; subsequent calls fast |
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Anti-Pattern 1: Telegram-Specific Voice Logic
|
||||
|
||||
**What people do:** Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler.
|
||||
|
||||
**Why it's wrong:** Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation.
|
||||
|
||||
**Do this instead:** All voice processing goes through `voicePipelineService`. The Telegram handler calls `transcribe(buf, "ogg")` — the service handles format differences. The web route calls `transcribe(buf, "webm")` — same service, different format argument.
|
||||
|
||||
### Anti-Pattern 2: Circular HTTP Call for Telegram AI Response
|
||||
|
||||
**What people do:** Telegram bot handler calls `fetch("http://localhost:PORT/api/conversations/:id/stream")` to get AI responses from within the same server process.
|
||||
|
||||
**Why it's wrong:** Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running.
|
||||
|
||||
**Do this instead:** `telegramService` imports `chatService(db)` and `puterProxyService(db)` directly. Collect tokens from the async generator into a string, then send to Telegram as a single message.
|
||||
|
||||
### Anti-Pattern 3: Blocking grammY on Slow CLI Processes
|
||||
|
||||
**What people do:** `await synthesize()` inside a bot handler with no timeout, assuming piper is always available and fast.
|
||||
|
||||
**Why it's wrong:** If the `piper` binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely.
|
||||
|
||||
**Do this instead:** Wrap CLI calls in a `Promise.race([piperCall, timeout(8_000)])`. If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode.
|
||||
|
||||
### Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts
|
||||
|
||||
**What people do:** Leave the STT handler in `chat-files.ts` and call `voicePipelineService` from there, adding Nexus-specific logic to an upstream-sourced file.
|
||||
|
||||
**Why it's wrong:** `chat-files.ts` is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface.
|
||||
|
||||
**Do this instead:** Move `/transcribe` and `/synthesize` to a new `voice.ts` route file (Nexus-only, never in upstream). Keep `chat-files.ts` as close to upstream as possible.
|
||||
|
||||
### Anti-Pattern 5: Storing Telegram Token in Database
|
||||
|
||||
**What people do:** Create a new DB table or add a column to `instance_settings` to store the Telegram bot token.
|
||||
|
||||
**Why it's wrong:** Any DB schema change blocks upstream rebase (migration files conflict). The `nexus-settings.json` file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent.
|
||||
|
||||
**Do this instead:** Store `telegramToken` in `nexus-settings.json` via the existing `nexusSettingsService`. Same pattern as `voiceEnabled`, `mode`.
|
||||
|
||||
---
|
||||
The `StorageService.putFile()` method already handles path construction from `namespace` + `originalFilename` + timestamp. Pass `namespace: "generated/diagrams"` etc.
|
||||
|
||||
## Integration Points
|
||||
|
||||
### External Services
|
||||
### Existing System Touch Points
|
||||
|
||||
| Service | Integration Pattern | Notes |
|
||||
|---------|---------------------|-------|
|
||||
| Telegram Bot API | grammY `bot.start()` long-polling (Node.js) | No public URL required; polling starts on server boot if token present in nexus-settings |
|
||||
| whisper-cpp / openai-whisper | `execFile` cascade (same as existing `/transcribe`) | Format argument added: writes `.webm` or `.ogg` temp file based on input |
|
||||
| piper TTS binary | `child_process.spawn` stdin -> stdout | Text piped to stdin; WAV or raw PCM bytes collected from stdout |
|
||||
| Integration Point | How Content Gen Connects | Notes |
|
||||
|-------------------|--------------------------|-------|
|
||||
| `assetService` + `assets` table | Every rendered output creates an asset row | `createdByAgentId` already supported; agents get credit |
|
||||
| `StorageService` | All rendered blobs stored via existing `putFile()` | Use `generated/` namespace prefix; no service changes |
|
||||
| `publishLiveEvent` | Job lifecycle events push to SSE stream | Extend `LIVE_EVENT_TYPES` in `packages/shared/src/constants.ts` |
|
||||
| `ChatMarkdownMessage` | Inline diagram rendering; "save as asset" button | Mermaid already a UI dep; add `DiagramRenderer` wrapper |
|
||||
| `companySkills` + `skill-registry` | Content types as installable skill markdown files | No schema change; skills are text files agents read as context |
|
||||
| `placeholderService` | Placeholder assets tracked in PLACEHOLDERS.md manifest | Optionally extend `PlaceholderEntry` with `contentJobId` |
|
||||
| `hardwareService` | Detect if Remotion/Puppeteer can run | M4 Mac Mini: arm64 Chromium available, 24GB unified memory sufficient |
|
||||
| `companyId` scoping | All content jobs scoped to `companyId` | Consistent with every other resource in the system |
|
||||
| Agent task sessions | Agents invoke content APIs during task execution | Use `createdByAgentId`; same pattern as `documents`, `work-products` |
|
||||
|
||||
### Internal Boundaries
|
||||
### External Dependencies (new server deps)
|
||||
|
||||
| Boundary | Communication | Notes |
|
||||
|----------|---------------|-------|
|
||||
| voice route <-> voicePipelineService | Direct function call | Route is thin HTTP wrapper; all logic in service |
|
||||
| telegram service <-> voicePipelineService | Direct function call | Same service used by both transports |
|
||||
| telegram service <-> chatService | Direct function call | Bot calls `chatService(db)` directly — no HTTP round-trip |
|
||||
| telegram service <-> nexusSettingsService | Direct function call | Reads `voiceMode` and `telegramToken` at start and on each message |
|
||||
| web UI <-> voice route | REST: `POST /api/transcribe`, `POST /api/synthesize` | Web client uses browser-side piper WASM for TTS; `/synthesize` primarily for Telegram |
|
||||
| UI VoiceModeToggle <-> nexus-settings | REST: `PATCH /api/nexus-settings` | Reads/writes `voiceMode` setting |
|
||||
| Dependency | Purpose | Platform Notes |
|
||||
|------------|---------|---------------|
|
||||
| `@mermaid-js/mermaid-isomorphic` | Server-side Mermaid to SVG | No Chromium needed; fast; preferred over Puppeteer for Mermaid |
|
||||
| `puppeteer` | HTML to PDF | ~300MB install; bundled arm64 Chromium works on M4; only add if PDF is a phase priority |
|
||||
| `remotion` (CLI) | Video/presentation render | Add as devDep in `remotion-compositions` package; CLI called via subprocess |
|
||||
|
||||
---
|
||||
Mermaid client-side and Sharp are already present. No changes needed for those paths.
|
||||
|
||||
## Build Order
|
||||
## Scaling Considerations
|
||||
|
||||
Based on component dependencies, the recommended build order within this milestone:
|
||||
This is a Mac Mini M4 single-user deployment. Analysis focuses on resource contention, not user count.
|
||||
|
||||
| Step | Component(s) | Reason |
|
||||
|------|-------------|--------|
|
||||
| 1 | `nexus-settings` schema extensions (`voiceMode`, `telegramToken`) | Everything downstream reads settings |
|
||||
| 2 | `voicePipelineService` | Backs all voice. No new deps. Independently testable. |
|
||||
| 3 | `voice.ts` route (`POST /transcribe`, `POST /synthesize`) | Thin wrapper. Register in `app.ts`. Move handler from chat-files. |
|
||||
| 4 | `VoiceMicButton` + `WaveformDisplay` + `useSilenceDetection` | Pure UI. Depends only on `/transcribe`. |
|
||||
| 5 | `VoiceModeToggle` + `useVoiceMode` | Depends on `voiceMode` in nexus-settings schema (Step 1). |
|
||||
| 6 | `ChatMessage` dual output | Depends on `voiceMode` in shared `ChatMessage` type. |
|
||||
| 7 | `createMessageSchema` + `ChatMessage` type (`voiceMode` flag) | Shared package change. Required by Steps 5-6. Could move earlier. |
|
||||
| 8 | `telegramService` | Depends on voicePipelineService (2), chatService (existing), nexusSettings (1). |
|
||||
| 9 | `telegram.ts` route + app.ts registration | Management endpoints. Needs telegramService. |
|
||||
| 10 | Onboarding STT/TTS hardware detection step | Final: wires all voice detection into onboarding flow. |
|
||||
| Concern | Approach |
|
||||
|---------|----------|
|
||||
| Concurrent render jobs | Node.js event loop is safe for I/O. CPU-bound renders (Remotion, Puppeteer) spawn subprocesses, keeping the event loop responsive. |
|
||||
| Remotion render duration | Renders can take minutes. Never synchronous HTTP. Async job pattern + SSE progress is mandatory. |
|
||||
| Chromium memory (PDF/Puppeteer) | Puppeteer can use 500MB+ per render. Serialize PDF renders via an in-memory queue (one at a time). |
|
||||
| Storage growth | Generated content accumulates. Add `retention_days` field to `content_jobs`; implement a cleanup cron using the existing `cron.ts` service. |
|
||||
| Remotion bundle step | Bundle compositions once at server startup (or on demand). Never bundle on each render request — it takes 30-60s. |
|
||||
|
||||
Steps 4-6 can run in parallel with Steps 7-9 if split across phases.
|
||||
## Build Order (Phase Dependencies)
|
||||
|
||||
---
|
||||
Dependencies flow from infrastructure upward to content types upward to UI.
|
||||
|
||||
```
|
||||
Phase A: Core Infrastructure (unblocks everything)
|
||||
- Add content_jobs schema + migration (db package)
|
||||
- Extend LIVE_EVENT_TYPES with content.job.* (shared package)
|
||||
- Implement contentJobService (server)
|
||||
- Implement renderPipelineService stub (server)
|
||||
- Add API routes + app.ts mounts (server)
|
||||
|
||||
Phase B: Fast Content Types (no heavy binary deps; validates pipeline end-to-end)
|
||||
- svgGeneratorAdapter (pure TypeScript; icons, placeholders)
|
||||
- mermaidRendererAdapter (@mermaid-js/mermaid-isomorphic; no Chromium)
|
||||
- themeEngineService (pure computation)
|
||||
- UI: DiagramRenderer, ThemePreview, ContentJobViewer
|
||||
|
||||
Phase C: Client-Side Mermaid + Content Gallery
|
||||
- ChatMarkdownMessage extension (detect mermaid fence)
|
||||
- DiagramRenderer client-side component
|
||||
- ContentGallery page + assetService.list()
|
||||
- GeneratedAssetCard
|
||||
|
||||
Phase D: Document Generation (introduces Puppeteer)
|
||||
- Add puppeteer to server deps
|
||||
- pdfRendererAdapter
|
||||
- PDF download flow in UI
|
||||
|
||||
Phase E: Video / Presentations (introduces Remotion)
|
||||
- packages/remotion-compositions/ workspace package
|
||||
- remotionRendererAdapter (CLI subprocess)
|
||||
- Video playback in UI
|
||||
|
||||
Phase F: Image Generation
|
||||
- imageProcessorAdapter using Sharp (banners, OG images, social cards)
|
||||
- imageGenerationAdapter interface (Stable Diffusion / cloud APIs — future)
|
||||
- Social media content generation
|
||||
|
||||
Phase G: Content as Skills (no code, pure skill markdown)
|
||||
- Skill markdown files for each content type in company_skills
|
||||
- Agent-callable via existing skill system
|
||||
```
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Anti-Pattern 1: Rendering inside chatService
|
||||
|
||||
**What people do:** Add Mermaid rendering to `chatService` or `documentService` because content requests arrive from chat.
|
||||
|
||||
**Why it's wrong:** Couples unrelated concerns. Future content types (video, PDF) would bloat chatService and block upstream rebases.
|
||||
|
||||
**Do this instead:** `chatService` calls `contentJobService.enqueue()`. Rendering is entirely separate. Chat is a trigger, not an owner.
|
||||
|
||||
### Anti-Pattern 2: Synchronous HTTP response for long renders
|
||||
|
||||
**What people do:** `POST /render/remotion` holds the connection open for 2+ minutes while rendering.
|
||||
|
||||
**Why it's wrong:** HTTP timeout (30s default on most proxies). No progress feedback. Retry hell.
|
||||
|
||||
**Do this instead:** Return a `contentJobId` immediately with `202 Accepted`. Client subscribes to SSE `content.job.done` event.
|
||||
|
||||
### Anti-Pattern 3: One DB table per content type
|
||||
|
||||
**What people do:** Add separate `diagrams`, `presentations`, `themes` tables.
|
||||
|
||||
**Why it's wrong:** The existing `assets` table already handles typed binary blobs. The `content_jobs` table handles any render job regardless of output type. Fragmented schema multiplies migration surface.
|
||||
|
||||
**Do this instead:** Use `content_jobs.type` to discriminate job types. Use `assets.content_type` to discriminate output format. One jobs table, one assets table.
|
||||
|
||||
### Anti-Pattern 4: Bypassing StorageService for renderer output
|
||||
|
||||
**What people do:** Remotion adapter writes to `/tmp` and returns a filesystem path.
|
||||
|
||||
**Why it's wrong:** Bypasses the provider abstraction (local disk vs S3), deduplication (sha256), the `assets` table, and download URL generation.
|
||||
|
||||
**Do this instead:** Renderer writes output to a `Buffer`, passes to `StorageService.putFile()`, returns `objectKey`. Asset serving goes through the existing `/api` asset download route.
|
||||
|
||||
### Anti-Pattern 5: Modifying upstream DB tables
|
||||
|
||||
**What people do:** Add a `generated_content_type` column to the existing `assets` table.
|
||||
|
||||
**Why it's wrong:** Modifies upstream schema — migration conflict on next `git rebase upstream/master`. Violates the display-only fork constraint.
|
||||
|
||||
**Do this instead:** Use the `content_jobs.asset_id` FK as the signal that an asset is generated. Query `content_jobs JOIN assets` to distinguish generated from uploaded. Keep `assets` table untouched.
|
||||
|
||||
### Anti-Pattern 6: Remotion bundle on every render request
|
||||
|
||||
**What people do:** Call `bundle()` inside the render adapter on each job.
|
||||
|
||||
**Why it's wrong:** Bundling takes 30-60s. Renders that should take 5s take 90s.
|
||||
|
||||
**Do this instead:** Bundle once at server startup (or lazily on first render, cached). The `remotionRendererAdapter` calls `renderMedia()` against the pre-built bundle path.
|
||||
|
||||
## Sources
|
||||
|
||||
- Direct codebase inspection: `server/src/routes/chat-files.ts` (lines 297-386), `server/src/routes/chat.ts`, `server/src/services/nexus-settings.ts`, `server/src/app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `ui/src/components/TtsButton.tsx`, `ui/src/hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts`
|
||||
- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
|
||||
- `.planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md` — existing voice implementation details, WASM TTS pattern
|
||||
- [grammY documentation](https://grammy.dev/) — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks
|
||||
- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling recommended for single-user local; Express integration pattern
|
||||
- [rhasspy/piper (archived)](https://github.com/rhasspy/piper) — CLI: `echo "text" | piper --model voice.onnx -f -`; development moved to OHF-Voice/piper1-gpl Oct 2025
|
||||
- grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed
|
||||
- Direct inspection of `/opt/nexus` codebase (2026-04-04):
|
||||
- `server/src/services/` — factory function service patterns
|
||||
- `server/src/storage/` — `StorageService` / `StorageProvider` interfaces
|
||||
- `server/src/storage/service.ts` — `buildObjectKey()` namespace + path conventions
|
||||
- `server/src/services/live-events.ts` — SSE event bus (`publishLiveEvent`, `subscribeCompanyLiveEvents`)
|
||||
- `server/src/services/voice-pipeline.ts` — async subprocess service pattern
|
||||
- `server/src/services/placeholder-service.ts` — existing `PlaceholderEntry` manifest service
|
||||
- `server/src/services/assets.ts` — `assetService` factory (minimal; extend for listing)
|
||||
- `server/src/services/work-products.ts` — job/output separation pattern
|
||||
- `packages/db/src/schema/assets.ts` — existing `assets` table
|
||||
- `packages/db/src/schema/documents.ts` — document + revision pattern
|
||||
- `packages/shared/src/constants.ts` — `LIVE_EVENT_TYPES` (currently 9 types)
|
||||
- `server/src/app.ts` — route mounting conventions
|
||||
- `server/src/routes/voice.ts` — SSE streaming response pattern
|
||||
- `ui/package.json` — confirms `mermaid ^11.12.0` already installed
|
||||
- `server/package.json` — confirms `sharp`, `ffmpeg-static` already installed
|
||||
- Mermaid v11 isomorphic: https://mermaid.js.org/config/usage.html
|
||||
- Remotion CLI rendering: https://www.remotion.dev/docs/cli/render
|
||||
- `@mermaid-js/mermaid-isomorphic` for server-side rendering without a browser
|
||||
|
||||
---
|
||||
*Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6)*
|
||||
*Researched: 2026-04-03*
|
||||
*Architecture research for: Nexus v1.7 Content Generation*
|
||||
*Researched: 2026-04-04*
|
||||
|
|
|
|||
|
|
@ -1,30 +1,34 @@
|
|||
# Feature Research
|
||||
|
||||
**Domain:** Voice Pipeline (Whisper STT + Piper TTS) + Telegram Bridge (Nexus v1.6)
|
||||
**Researched:** 2026-04-03
|
||||
**Confidence:** MEDIUM-HIGH — STT/TTS pipeline patterns are well-documented; Telegram bot API is stable; dual-output formatting and voice mode UX patterns inferred from ChatGPT/Meta AI voice implementations and community patterns
|
||||
**Domain:** Content Generation Layer (Nexus v1.7) — agents produce visual, document, and media deliverables
|
||||
**Researched:** 2026-04-04
|
||||
**Confidence:** MEDIUM-HIGH — technology capabilities verified via docs and ecosystem research; UX expectations inferred from comparable tools (Canva, Pitch, Mermaid Live, Figma tokens); skill system patterns based on existing Nexus skill architecture
|
||||
|
||||
---
|
||||
|
||||
## Milestone Scope
|
||||
|
||||
This document covers only the NEW features in v1.6. The following are already built and are dependencies, not deliverables:
|
||||
This document covers only NEW features in v1.7. The following are already built and are dependencies, not deliverables:
|
||||
|
||||
- VoiceRecordButton with MediaRecorder API in ChatInput (v1.3)
|
||||
- TtsButton with @mintplex-labs/piper-tts-web WASM synthesis (v1.3/v1.5)
|
||||
- POST /transcribe endpoint with whisper-cpp/openai-whisper cascade (v1.3)
|
||||
- VoiceStep in onboarding wizard (v1.5)
|
||||
- voiceEnabled in nexus-settings (v1.5)
|
||||
- Full chat system with streaming SSE (v1.3)
|
||||
- File system with upload, git versioning, PLACEHOLDERS.md manifest (v1.3)
|
||||
- Skill system with Skill Aggregator and company skills API (Paperclip upstream)
|
||||
- Chat interface with streaming SSE (v1.3)
|
||||
- Agent orchestration, heartbeat lifecycle (Paperclip upstream)
|
||||
- Voice I/O with Whisper STT + Piper TTS (v1.6)
|
||||
- Hermes adapter with native skills, Ollama integration (v1.4)
|
||||
|
||||
**New features being researched:**
|
||||
- Transport-agnostic voice pipeline (server-side, not just browser WASM)
|
||||
- Voice mode flag on messages (affects response formatting)
|
||||
- Dual output pattern: voice-optimized prose + full markdown text
|
||||
- Web chat voice UI improvements: silence detection, waveform, auto-submit
|
||||
- Web chat audio playback: inline player, auto-play toggle
|
||||
- Voice mode toggle setting (text only / voice input / full voice)
|
||||
- Minimal Telegram bridge: single bot, text + voice relay, agent prefixing
|
||||
|
||||
- Presentations and video generation via Remotion
|
||||
- Placeholder assets with DRAFT styling and manifest tracking
|
||||
- Theme and palette generator (seed color → full theme, WCAG AA, exports)
|
||||
- Wallpapers and visual assets (desktop/mobile, banners, OG images)
|
||||
- Diagram generation (natural language → Mermaid → SVG/PNG)
|
||||
- Document generation (PDF reports, invoices, one-pagers)
|
||||
- Icon generation (SVG from description, consistent sets)
|
||||
- Social media content (platform-formatted posts, carousels, hashtags)
|
||||
- Branding media kit (full brand identity from conversation)
|
||||
- Content types as installable Nexus skills
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -32,131 +36,148 @@ This document covers only the NEW features in v1.6. The following are already bu
|
|||
|
||||
### Table Stakes (Users Expect These)
|
||||
|
||||
Features users assume exist when voice or Telegram is mentioned. Missing these makes the feature feel broken or incomplete.
|
||||
Features that must exist for content generation to feel complete. Missing any of these and the deliverable is not production-ready.
|
||||
|
||||
| Feature | Why Expected | Complexity | Notes |
|
||||
|---------|--------------|------------|-------|
|
||||
| Silence-based auto-submit | Every voice input UI (Siri, Google, Whisper demos) stops recording on silence; holding a button feels archaic | MEDIUM | WebRTC VAD or AudioWorklet amplitude monitoring; 1.5s silence threshold typical; must show countdown so user knows what's happening |
|
||||
| Waveform/amplitude visualization while recording | Users expect visual feedback that the mic is active; a static "recording..." text feels broken | LOW | Canvas or SVG with 30-50 data points; AnalyserNode from Web Audio API; real-time amplitude bars, not pre-rendered waveform |
|
||||
| Voice response auto-play toggle | If the AI responded with audio, playing it automatically is expected unless the user disabled it; manual play-only feels incomplete | LOW | Boolean setting in nexus-settings (voiceAutoPlay); inline HTML5 `<audio>` element is sufficient; Web Audio API not needed |
|
||||
| Markdown-free voice responses | Users who hear responses read aloud expect prose sentences, not "asterisk asterisk bold asterisk asterisk code block triple backtick" spoken aloud | MEDIUM | Requires voice mode flag on the message sent to LLM; system prompt addendum: "respond in natural spoken prose, no markdown symbols, no bullet points, no code blocks unless the user explicitly asks"; dual output requires separate LLM pass or post-processing strip |
|
||||
| Telegram text relay to existing chat | Sending a text message to the Telegram bot and receiving the agent's reply is the core use case; anything less is not a bridge | MEDIUM | Telegraf (Node.js) as bot framework; message forwarded to existing chat API endpoint; response prefixed with agent name |
|
||||
| Telegram voice message transcription | Telegram users frequently send voice notes; a bridge that ignores voice messages frustrates mobile users immediately | MEDIUM | Telegram sends voice as OGG/Opus; download → convert (ffmpeg) → POST /transcribe → forward text to agent → reply with text (+ optionally TTS audio back) |
|
||||
| Agent identity visible in Telegram replies | When multiple agents can respond, the user must know who is replying | LOW | Simple text prefix: `[Hermes] Your answer here`; consistent format across all messages |
|
||||
| Recording state visible in UI | Users must be able to tell when recording is active vs. idle vs. processing | LOW | Three states in mic button: idle (mic icon), recording (red pulsing), processing (spinner); state machine pattern |
|
||||
| Download produced file directly | Every generator (Canva, Pitch, Remotion) lets you download the output; no download = the tool is a preview, not a generator | LOW | Route: `GET /api/content/:jobId/download`; serve the file buffer with correct MIME type and `Content-Disposition: attachment` |
|
||||
| Show a preview of the output | Users need to see what was generated before downloading — a blank "file ready" message is not sufficient | MEDIUM | For images/SVG: `<img>` or inline SVG; for PDF: `<iframe src="...">` or pdf.js; for Remotion: Remotion Player component; for diagrams: rendered SVG inline |
|
||||
| Generation status feedback | Content generation jobs (PDF, Remotion render, Mermaid) can take 5–60s; a spinner with no progress info creates anxiety | LOW | SSE progress events or polling with status enum: `queued → generating → ready → error`; show estimated time where possible |
|
||||
| Error recovery with explanation | If generation fails, the user needs to know why and what to try next | LOW | Return structured error: `{ error: "model_not_installed", message: "Ollama llava model required for image generation", suggestion: "Run: ollama pull llava" }` |
|
||||
| Save output to file system | Generated files belong in the existing file system so they participate in git versioning and the PLACEHOLDERS.md manifest | MEDIUM | Integrate with existing file upload pipeline: write generated file to workspace directory, call git versioning hook, update PLACEHOLDERS.md |
|
||||
| Re-generate with revised prompt | Users iterate on generated content; "generate and you're done" misses 80% of actual workflow | LOW | Store generation parameters (prompt, theme, options) with the job; "regenerate" re-calls with same params + optional overrides |
|
||||
| Content type labeled clearly | A diagram, a PDF, and a video are fundamentally different outputs; conflating them in one UI creates confusion | LOW | Each content type has a distinct icon, label, and preview strategy; use a type registry not ad hoc conditionals |
|
||||
|
||||
### Differentiators (Competitive Advantage)
|
||||
|
||||
Features that make v1.6's voice and Telegram features worth using, beyond baseline functionality.
|
||||
Features that distinguish Nexus content generation from Canva/Pitch/generic generators.
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| Transport-agnostic voice pipeline | Voice processing works identically for browser input, Telegram voice notes, and future CLI/API callers; no duplication of Whisper/Piper logic | MEDIUM | Abstract to a `VoicePipelineService`: `transcribe(audioBuffer) → text`, `synthesize(text, voice?) → audioBuffer`; HTTP endpoints call the service; Telegram bot calls the same service |
|
||||
| Dual output pattern | AI responds with two representations: short spoken-prose version (for TTS/Telegram) and full markdown version (for web chat, copy-paste, code); user sees both where appropriate | HIGH | Prompt engineering: "Provide a SPOKEN response (1-3 sentences, no markdown) and a DETAILED response (full markdown). Format: SPOKEN: … DETAILED: …"; parse and split in middleware; store both in message metadata |
|
||||
| Sentence-buffered TTS streaming | Start playing the first sentence while the second is still synthesizing; reduces perceived latency vs. waiting for full response | MEDIUM | Split response on `.!?`; Piper synthesizes sentence 1, audio starts playing; meanwhile sentence 2 begins synthesis; append chunks to audio queue |
|
||||
| Voice mode flag preserves context | Messages tagged with `voice_mode: true` in the DB let the UI, Telegram bridge, and future Command Center all render correctly without re-inferring intent | LOW | Add `source` field or `voice_mode` boolean to message metadata; already-existing message schema likely supports metadata/extras column |
|
||||
| Telegram as thin relay (not a separate chat product) | The Telegram bot forwards to the existing Nexus chat engine; responses use the full agent intelligence already configured; no separate bot personality to maintain | LOW | Relay pattern: Telegram message → POST /api/workspaces/:id/chat/messages → SSE stream → collect full response → reply to Telegram; agent prefixing is presentation only |
|
||||
| Language auto-detection in STT | Whisper natively detects language without configuration; relay this info back to the UI so the user knows what language was detected | LOW | Whisper returns `language` in its JSON output; pass through to transcript response; log in message metadata; no user config needed for common languages |
|
||||
| Agent-driven generation from chat | User describes what they want in natural language; the agent selects the right content type, calls the generation skill, and delivers a file — no form-filling | HIGH | Requires agent skill routing: agent detects "make me a diagram of the user flow" → invokes `diagram-generation` skill → returns SVG file attachment in chat; skill selection uses existing Nexus Skill Aggregator |
|
||||
| Content types as installable skills | Each generator (diagrams, PDFs, wallpapers, presentations) is a separate installable skill, not a monolithic feature — users install only what they need | HIGH | Follow existing `skills/paperclip/` pattern; each skill has `SKILL.md`, optional `references/`, and is registered via company skills API; agents get assigned the relevant skills |
|
||||
| PLACEHOLDERS.md manifest integration | Every generated asset — including draft placeholders — is tracked in the existing PLACEHOLDERS.md manifest with status (draft/final), source prompt, and generator version | MEDIUM | Extend existing manifest format; add `generator`, `prompt_hash`, `generated_at` fields; draft assets get DRAFT watermark applied at generation time |
|
||||
| Seed-color-to-full-theme pipeline | User provides one hex color; system generates a complete accessible color system (primary, secondary, accent, neutral, semantic colors) with WCAG AA compliance verified | HIGH | Use OKLCH/LCh color model for perceptually uniform lightness; verify 4.5:1 contrast for text pairs; export to CSS custom properties, Tailwind config, and JSON token format |
|
||||
| WCAG AA enforced, not optional | Theme generator refuses to export palettes that fail contrast requirements — accessibility is baked in, not a checkbox | MEDIUM | Run contrast check on every text/background pair; if a combination fails, auto-adjust lightness until compliant; show contrast ratio in preview |
|
||||
| Diagram from natural language | User says "flowchart of the auth pipeline" in chat; agent generates Mermaid syntax, renders SVG, attaches to the conversation | MEDIUM | LLM generates valid Mermaid; `@mermaid-js/mermaid-cli` renders server-side; fallback: return raw Mermaid for user to copy into mermaid.live; store `.mmd` source + rendered `.svg` |
|
||||
| Local-only operation | All generation (Remotion render, PDF, Mermaid, Satori images) runs on Mac Mini M4 without cloud API calls; no data leaves the machine | MEDIUM | Reject cloud-dependent generators (no DALL-E, no Stable Diffusion API); prefer deterministic generators (Mermaid, Satori, Remotion, Playwright PDF) that need only the local LLM + local tools |
|
||||
| Branding kit from conversation | User chats about their project; agent extracts brand DNA (colors, typography, tone) and produces a coherent brand kit — no form, no design background required | HIGH | Multi-step: LLM extracts brand parameters → theme generator → typography selector → icon style picker → assembles ZIP with CSS tokens, font stack, sample SVG logo, OG image template |
|
||||
|
||||
### Anti-Features (Commonly Requested, Often Problematic)
|
||||
|
||||
| Feature | Why Requested | Why Problematic | Alternative |
|
||||
|---------|---------------|-----------------|-------------|
|
||||
| Real-time speech-to-speech streaming | Feels like a "next level" voice experience | Requires full-duplex WebSocket audio, interrupt handling, turn-taking logic, VAD on both ends — an entirely different architecture (Pipecat, LiveKit); out of scope for a relay bridge | Sequential pipeline (speak → wait → hear) is sufficient for assistant use cases; real-time is only needed for phone-call-style interaction |
|
||||
| Per-agent Telegram bots | "My PM agent should have its own bot handle" | Multiple bots means multiple bot tokens, multiple webhook registrations, complex routing when agents hand off to each other; maintenance nightmare | Single bot with agent name prefix in messages: `[PM] Here is your sprint plan`; PROJECT.md explicitly out-of-scopes this |
|
||||
| Deep Telegram ↔ web chat sync | "I want to see Telegram messages in the web UI" | Real-time bidirectional sync requires a shared event bus (Postgres LISTEN/NOTIFY or Redis pub/sub), session management across transports, and conflict resolution; PROJECT.md explicitly defers this to "Postgres bus" future milestone | Relay is one-way per session: Telegram message → agent → Telegram reply; web chat is a separate session |
|
||||
| Wake word detection | "Hey Nexus, start recording" | Requires always-on microphone access, local wakeword model (Porcupine, OpenWakeWord), and careful battery/privacy handling; browser does not allow always-on mic | Mic button tap is sufficient; wake word is a future hardware device concern |
|
||||
| Streaming TTS word-by-word | Feels maximally responsive | Browser audio playback of a stream of tiny WAV fragments causes clicks, gaps, and buffering issues; each Piper call has startup overhead; the sentence-buffered approach gives 95% of the benefit | Sentence-buffered playback (buffer on `.!?`); start playing sentence 1 while sentence 2 synthesizes |
|
||||
| Inline code execution over Telegram | "I want to run tasks from Telegram" | Security: arbitrary code execution via an unauthenticated chat interface; scope: Telegram bridge is explicitly a thin relay, not a command interface | Support text and voice message relay only; task creation via conversational agent response is sufficient |
|
||||
| GSD formatting / rich elements in Telegram | Telegram supports inline keyboards, threaded replies — use them | Telegram's formatting model (inline keyboards, callback queries) requires stateful session tracking; PROJECT.md explicitly out-of-scopes this | Plain text + Markdown v1 (which Telegram natively renders for bold/italic/code); no inline keyboards in v1.6 |
|
||||
| Transcription editing before sending | "Let me see the transcript before it goes to the agent" | Adds a confirmation step that breaks the hands-free voice flow; most users trust auto-send after VAD silence detection; optionally show transcript as a message in the UI after the fact | Show the detected transcript in the chat message bubble with a small "mic" icon; no edit step |
|
||||
| Image generation via Stable Diffusion or DALL-E | "Make me an illustration of..." seems like a natural content type | SD requires GPU VRAM (conflicts with LLM VRAM budget on M4); DALL-E is cloud, data leaves machine; output quality is non-deterministic and hard to brand-consistently | Deterministic vector tools: Satori for OG images/banners, icon description → SVG path via LLM (text-to-SVG), wallpaper = CSS gradient/pattern composition; no raster AI images in v1.7 |
|
||||
| Real-time collaborative editing of generated content | "Let the agent iterate with me live" | Requires a full rich-text or canvas editor (collaborative editing is a product in itself); far outside scope | Chat-and-regenerate loop: show output, accept feedback as text, regenerate — no in-place editing |
|
||||
| Font embedding in all output formats | "I want my brand font in the PDF and video" | Font licensing for system-level embed is complex; font subsetting in PDF requires careful handling; Remotion font loading has SSR implications | Use system-safe font stacks for PDF (Helvetica, Times, Courier are embed-safe); Remotion uses `@remotion/google-fonts` for web-safe options; custom font is a v2 concern |
|
||||
| Batch generation (50 social posts at once) | "Generate a month of content in one click" | Job queue depth, disk space, and UI feedback for 50 concurrent generation jobs is significant infrastructure work | Single-at-a-time generation with a "generate next variant" button; queue infrastructure is a v2 concern |
|
||||
| Auto-publish to social platforms | "Post directly to Twitter/LinkedIn" | OAuth token management per platform, platform API rate limits, legal liability for AI-generated content posted as the user | Download + manual post; provide platform-formatted file with exact recommended dimensions; no publishing API integration in v1.7 |
|
||||
| Template marketplace / sharing | "Share my Remotion template with others" | Multi-user/multi-workspace concerns; the Nexus model is single-workspace, single-user | Templates stored in workspace file system under `templates/`; user can git-push to share; no marketplace infrastructure |
|
||||
| Animated / lottie social content | "Animated post for Instagram stories" | Lottie export from Remotion is possible but adds significant complexity; Instagram animated format requirements are strict | Static images for social in v1.7; Remotion video export covers the animation use case separately |
|
||||
| AI logo design (raster output) | "Generate my company logo" | AI raster logos are non-scalable and inconsistent across regenerations; brand identity requires reproducibility | SVG icon generation from description using LLM-as-code (the LLM writes SVG path code); deterministic, scalable, reproducible |
|
||||
|
||||
---
|
||||
|
||||
## Feature Dependencies
|
||||
|
||||
```
|
||||
Transport-Agnostic VoicePipelineService
|
||||
└──wraps──> Existing /transcribe endpoint (Whisper) [already built]
|
||||
└──wraps──> Piper TTS binary/WASM [already built in browser; server-side is new]
|
||||
└──consumed-by──> Web chat mic button (browser calls server or uses WASM directly)
|
||||
└──consumed-by──> Telegram bridge (server-side calls VoicePipelineService)
|
||||
└──consumed-by──> Future transports (CLI, API, Command Center)
|
||||
Content Skill System (foundation)
|
||||
└──required-by──> All content types (diagrams, PDFs, presentations, themes, icons)
|
||||
└──requires──> Existing Skill Aggregator + company skills API [already built]
|
||||
└──requires──> Agent skill routing (chat → skill invocation → file attachment)
|
||||
|
||||
Voice Mode Flag
|
||||
└──set-by──> Web chat (user is in voice mode)
|
||||
└──set-by──> Telegram bridge (message arrived as voice note)
|
||||
└──consumed-by──> LLM prompt construction (appends no-markdown instruction)
|
||||
└──consumed-by──> Dual output pattern (triggers two-response format)
|
||||
└──consumed-by──> TTS synthesis (triggers auto-synthesis of response)
|
||||
Diagram Generation
|
||||
└──requires──> @mermaid-js/mermaid-cli (server-side render to SVG/PNG)
|
||||
└──requires──> LLM to generate valid Mermaid syntax
|
||||
└──produces──> SVG + raw .mmd source → saved to file system
|
||||
|
||||
Dual Output Pattern
|
||||
└──requires──> Voice mode flag (only triggers in voice mode)
|
||||
└──requires──> LLM prompt engineering (structured SPOKEN/DETAILED format)
|
||||
└──produces──> Short prose (for TTS, Telegram reply)
|
||||
└──produces──> Full markdown (for web chat display, copy)
|
||||
PDF Generation
|
||||
└──requires──> Playwright (headless Chromium) OR pdf-lib (programmatic)
|
||||
└──choice: Playwright for HTML-template-based PDFs (reports, one-pagers)
|
||||
└──choice: pdf-lib for programmatic PDFs (invoices, receipts with data)
|
||||
└──produces──> .pdf file → saved to file system
|
||||
|
||||
Web Chat Voice UI (silence detection + waveform)
|
||||
└──requires──> Existing VoiceRecordButton [already built — enhance, not replace]
|
||||
└──requires──> Web Audio API (AnalyserNode for amplitude) [browser built-in]
|
||||
└──enhances──> Voice Mode Toggle (waveform only visible when voice mode active)
|
||||
Theme Generator
|
||||
└──requires──> Color math library (chroma-js or culori for OKLCH)
|
||||
└──requires──> WCAG contrast calculation (wcag-color-contrast or manual APCA)
|
||||
└──produces──> CSS custom properties file + JSON tokens + Tailwind config
|
||||
└──consumed-by──> Branding media kit (uses theme as input)
|
||||
|
||||
Web Chat Audio Playback
|
||||
└──requires──> TTS synthesis output (WAV/MP3 audio buffer)
|
||||
└──requires──> Voice mode flag (auto-play only in full voice mode)
|
||||
└──independent──> waveform visualization (different UI component)
|
||||
Remotion Presentations + Video
|
||||
└──requires──> @remotion/renderer (server-side render to MP4/still frames)
|
||||
└──requires──> Node.js >=18 (already met)
|
||||
└──requires──> ffmpeg-static (already in stack from v1.6 for audio; reused for video)
|
||||
└──produces──> .mp4 or .png stills → saved to file system
|
||||
└──optionally-uses──> Theme generator (colors, typography from brand kit)
|
||||
|
||||
Telegram Bridge
|
||||
└──requires──> VoicePipelineService (for voice note handling)
|
||||
└──requires──> Existing chat API (POST /api/... for message relay)
|
||||
└──requires──> ffmpeg (OGG/Opus → WAV conversion for Whisper)
|
||||
└──requires──> Telegraf (Node.js bot framework)
|
||||
└──independent──> web chat UI changes
|
||||
Wallpapers + Visual Assets (OG images, banners, social headers)
|
||||
└──requires──> Satori (HTML/CSS → SVG) + Sharp (SVG → PNG, resize)
|
||||
└──requires──> Platform dimension registry (OG: 1200×630, LinkedIn: 1584×396, etc.)
|
||||
└──produces──> PNG files at multiple sizes → saved to file system
|
||||
└──optionally-uses──> Theme generator (brand colors)
|
||||
|
||||
Onboarding STT/TTS Detection
|
||||
└──requires──> Existing VoiceStep [already built — update, not replace]
|
||||
└──requires──> VoicePipelineService availability check
|
||||
└──independent──> Telegram bridge
|
||||
Icon Generation (SVG)
|
||||
└──requires──> LLM to generate SVG path code from description
|
||||
└──no external rendering lib needed (SVG is text)
|
||||
└──produces──> .svg files → saved to file system
|
||||
└──consumed-by──> Branding media kit
|
||||
|
||||
Social Media Content
|
||||
└──requires──> Wallpaper/banner generator (for image posts)
|
||||
└──requires──> LLM (for copy: captions, hashtags, platform-appropriate tone)
|
||||
└──requires──> Platform spec registry (image sizes, character limits per platform)
|
||||
└──produces──> Platform folder: {platform}/{size}.png + caption.txt + hashtags.txt
|
||||
|
||||
Branding Media Kit
|
||||
└──requires──> Theme generator (colors)
|
||||
└──requires──> Icon generator (SVG logo concept)
|
||||
└──requires──> Wallpaper generator (OG image, banner)
|
||||
└──requires──> LLM (typography pairing, brand voice, tagline)
|
||||
└──produces──> ZIP archive: brand-kit.zip containing all assets + CSS tokens
|
||||
|
||||
Placeholder Asset System
|
||||
└──requires──> File system with PLACEHOLDERS.md [already built]
|
||||
└──requires──> Any generator (diagram, wallpaper, PDF) to set draft flag
|
||||
└──produces──> Asset file with DRAFT watermark + PLACEHOLDERS.md entry
|
||||
└──resolves-via──> "generate final" command removes watermark, updates manifest
|
||||
```
|
||||
|
||||
### Dependency Notes
|
||||
|
||||
- **VoicePipelineService is the keystone:** Build this first. It abstracts Whisper + Piper behind a clean interface. Every other v1.6 feature is a consumer. If this is skipped, the Telegram bridge and web improvements become duplicate, divergent code.
|
||||
- **Voice mode flag must be stored on the message:** Not just passed in memory. Future Command Center and Telegram both need to know retroactively whether a message was voice-originated.
|
||||
- **Dual output is optional on non-voice messages:** Text-mode messages do not need the SPOKEN variant. The prompt injection and response parsing only apply when `voice_mode: true`.
|
||||
- **Telegram bridge has no UI:** It's a server-side Node.js process (or Express route). No React changes needed for Telegram.
|
||||
- **ffmpeg is a hard dependency for Telegram voice notes:** Telegram sends OGG/Opus; Whisper expects WAV/MP3. ffmpeg must be available on the server. On Mac Mini this is `brew install ffmpeg`.
|
||||
- **Web chat waveform enhances existing VoiceRecordButton:** Do not replace it. The existing component handles MediaRecorder and send; add AudioWorklet/AnalyserNode visualization on top.
|
||||
- **Content Skill System is the foundation.** Every content type is a skill. If the skill routing pattern is not established first, each content type becomes a disconnected one-off endpoint.
|
||||
- **Satori + Sharp is the image stack for all 2D raster outputs.** Do not introduce a separate image library per content type — Satori handles the JSX/CSS layout, Sharp handles PNG conversion and resizing. One pipeline for wallpapers, OG images, social headers, and banner generation.
|
||||
- **Playwright for HTML-template PDFs, pdf-lib for data-driven PDFs.** Do not use a single library for both — Playwright is better for design-rich output, pdf-lib is better for invoices and receipts. Use the right tool per use case.
|
||||
- **ffmpeg-static already in v1.6 stack.** Remotion's video pipeline reuses it — do not add a second FFmpeg dependency.
|
||||
- **Branding media kit is a composition of other skills.** It is not a standalone generator; it orchestrates theme → icons → wallpapers → copy and zips the outputs.
|
||||
- **PLACEHOLDERS.md integration is cross-cutting.** Every content generator must write to the manifest on save; this is not optional per the v1.7 milestone requirements.
|
||||
|
||||
---
|
||||
|
||||
## MVP Definition
|
||||
|
||||
### Launch With (v1.6 Milestone)
|
||||
### Launch With (v1.7 Milestone — P1)
|
||||
|
||||
Minimum viable set to make voice and Telegram genuinely useful, not just technically present.
|
||||
The minimum set to make content generation genuinely useful as a daily workflow tool.
|
||||
|
||||
- [ ] **VoicePipelineService** — Transport-agnostic server-side Whisper + Piper abstraction. Why essential: gates all other features; prevents code duplication between web and Telegram.
|
||||
- [ ] **Voice mode flag + dual output** — LLM receives no-markdown instruction; response splits into spoken prose + full markdown. Why essential: spoken markdown sounds broken; this is what makes TTS usable.
|
||||
- [ ] **Web chat silence detection + auto-submit** — Amplitude-based VAD stops recording automatically and submits. Why essential: hands-free voice only works if the user does not have to click "send."
|
||||
- [ ] **Web chat waveform visualization** — Amplitude bars while recording. Why essential: without it, users cannot tell if the mic is picking up audio.
|
||||
- [ ] **Web chat audio playback with auto-play toggle** — Agent voice responses play inline. Why essential: without playback, TTS synthesis has nowhere to go.
|
||||
- [ ] **Voice mode toggle setting** — Three modes: text only / voice input only / full voice (input + output). Why essential: users need to control the modality per session.
|
||||
- [ ] **Telegram text relay** — Text messages in → agent response out, with agent prefix. Why essential: core use case for phone access.
|
||||
- [ ] **Telegram voice note relay** — Voice notes in → transcribe → agent → text reply. Why essential: mobile Telegram users default to voice notes.
|
||||
- [ ] **Content skill system scaffolding** — Skill registration pattern, agent routing, file attachment to chat; gates everything else
|
||||
- [ ] **Diagram generation** — NL → Mermaid → SVG/PNG; most requested, lowest complexity, immediate productivity value for a developer
|
||||
- [ ] **Theme and palette generator** — Seed color → full color system with WCAG AA; exports CSS tokens + JSON; standalone value even without other generators
|
||||
- [ ] **Placeholder asset system** — DRAFT watermark on any generated file + PLACEHOLDERS.md entry; prevents generated assets from being accidentally shipped unreviewed
|
||||
- [ ] **PDF generation** — Playwright-based HTML → PDF for reports/one-pagers; pdf-lib for invoices; solves a concrete recurring task
|
||||
- [ ] **Wallpapers and OG images** — Satori + Sharp pipeline; produces desktop wallpaper, OG image, and LinkedIn/Twitter header from a single theme config
|
||||
|
||||
### Add After Validation (v1.6.x)
|
||||
### Add After Validation (v1.7.x — P2)
|
||||
|
||||
- [ ] **Telegram TTS reply option** — Agent response synthesized and sent back as an OGG voice note. Trigger: user feedback that text replies are too long to read on phone.
|
||||
- [ ] **Sentence-buffered TTS streaming** — Start audio playback before full synthesis completes. Trigger: latency complaints with longer responses.
|
||||
- [ ] **Voice response history in UI** — Chat messages show audio player for past synthesized responses (not just the current one). Trigger: users want to replay previous responses.
|
||||
- [ ] **Icon generation (SVG)** — LLM-as-SVG-coder; trigger: user asks for consistent icon set for a project
|
||||
- [ ] **Social media content** — Platform-formatted posts + captions; trigger: user has a completed project and needs to announce it
|
||||
- [ ] **Remotion presentations** — React-component slides → MP4/stills; trigger: user needs a pitch deck or demo video; requires careful VRAM budget on M4
|
||||
|
||||
### Future Consideration (v2+)
|
||||
### Future Consideration (v2+ — P3)
|
||||
|
||||
- [ ] **Real-time speech-to-speech** — Full-duplex conversation; requires Pipecat or LiveKit; entirely different architecture.
|
||||
- [ ] **Wake word detection** — Always-on mic, local wakeword model; hardware device concern.
|
||||
- [ ] **Deep Telegram ↔ web sync** — Bidirectional session mirroring via Postgres bus; deferred per PROJECT.md.
|
||||
- [ ] **Per-transport voice models** — Different Piper voice for Telegram vs. web (e.g., cleaner phone voice vs. natural assistant voice).
|
||||
- [ ] **Branding media kit** — Full brand kit ZIP; requires all other generators to be stable first; high coordination cost
|
||||
- [ ] **Batch generation** — Multiple variants or sizes at once; requires job queue infrastructure
|
||||
- [ ] **Template library** — Reusable Remotion/Satori templates stored in workspace
|
||||
- [ ] **Font embedding** — Custom font in PDF and video; requires font licensing audit and subsetting
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -164,174 +185,199 @@ Minimum viable set to make voice and Telegram genuinely useful, not just technic
|
|||
|
||||
| Feature | User Value | Implementation Cost | Priority |
|
||||
|---------|------------|---------------------|----------|
|
||||
| VoicePipelineService | HIGH | MEDIUM | P1 |
|
||||
| Voice mode flag + dual output | HIGH | MEDIUM | P1 |
|
||||
| Silence detection + auto-submit | HIGH | MEDIUM | P1 |
|
||||
| Waveform visualization | MEDIUM | LOW | P1 |
|
||||
| Audio playback + auto-play toggle | HIGH | LOW | P1 |
|
||||
| Voice mode toggle setting | HIGH | LOW | P1 |
|
||||
| Telegram text relay | HIGH | MEDIUM | P1 |
|
||||
| Telegram voice note relay | HIGH | MEDIUM | P1 |
|
||||
| Telegram TTS reply | MEDIUM | MEDIUM | P2 |
|
||||
| Sentence-buffered TTS streaming | MEDIUM | MEDIUM | P2 |
|
||||
| Voice response history | LOW | MEDIUM | P3 |
|
||||
| Real-time speech-to-speech | HIGH | HIGH | P3 (v2+) |
|
||||
| Content skill system | HIGH | MEDIUM | P1 |
|
||||
| Diagram generation | HIGH | LOW | P1 |
|
||||
| Theme + palette generator | HIGH | MEDIUM | P1 |
|
||||
| Placeholder asset system | MEDIUM | LOW | P1 |
|
||||
| PDF generation | HIGH | MEDIUM | P1 |
|
||||
| Wallpapers + OG images (Satori) | MEDIUM | MEDIUM | P1 |
|
||||
| Icon generation (SVG) | MEDIUM | LOW | P2 |
|
||||
| Social media content | MEDIUM | MEDIUM | P2 |
|
||||
| Remotion presentations + video | HIGH | HIGH | P2 |
|
||||
| Branding media kit | HIGH | HIGH | P3 |
|
||||
| Batch generation | LOW | HIGH | P3 |
|
||||
| Template library | MEDIUM | MEDIUM | P3 |
|
||||
|
||||
**Priority key:**
|
||||
- P1: Must have for v1.6 launch
|
||||
- P2: Should have, add in v1.6.x
|
||||
- P3: Nice to have, v2+
|
||||
- P1: Must have for v1.7 launch
|
||||
- P2: Should have; add when P1 is stable
|
||||
- P3: Future milestone
|
||||
|
||||
---
|
||||
|
||||
## Content Type Profiles
|
||||
|
||||
Detailed breakdown of what each content type requires and delivers.
|
||||
|
||||
### Diagram Generation
|
||||
|
||||
**User trigger:** "Draw me a sequence diagram of the auth flow" in chat
|
||||
**Input:** Natural language description
|
||||
**Output:** `.svg` + `.mmd` (Mermaid source) files
|
||||
**Generator:** LLM → Mermaid syntax → `@mermaid-js/mermaid-cli` `run()` API
|
||||
**Preview:** Inline SVG in chat bubble
|
||||
**Complexity:** LOW — Mermaid CLI has Node.js programmatic API; LLMs are good at generating valid Mermaid
|
||||
**Risk:** LLM occasionally produces invalid Mermaid syntax; must validate and retry or surface the raw `.mmd` for user to fix in mermaid.live
|
||||
**Platform spec:** Vector SVG = no size constraint; PNG export at 2x for retina via `--scale 2`
|
||||
|
||||
### Theme + Palette Generator
|
||||
|
||||
**User trigger:** "Generate a color theme from #2563EB" or "Create a dark theme for my portfolio"
|
||||
**Input:** Seed hex color + optional: mode (light/dark/both), style (minimal/vibrant/muted)
|
||||
**Output:** `theme.css` (CSS custom properties), `theme.json` (design tokens), `tailwind.config.ts`
|
||||
**Generator:** Server-side color math using OKLCH/LCh model; no LLM required for color generation; LLM assists with labeling semantic colors (primary, danger, success)
|
||||
**Preview:** Live color swatches with contrast ratio overlay in UI
|
||||
**Complexity:** MEDIUM — OKLCH color math is non-trivial; WCAG AA enforcement requires iteration loop
|
||||
**WCAG AA rule:** Every text/background combination must hit 4.5:1 contrast ratio; auto-adjust lightness until compliant
|
||||
**Exports:** Three formats (CSS, JSON, Tailwind) from same data model
|
||||
|
||||
### Placeholder Asset System
|
||||
|
||||
**User trigger:** Any content generator can emit a "draft" asset; user explicitly marks an asset as a draft placeholder
|
||||
**Input:** Any generated file + optional placeholder label
|
||||
**Output:** File with diagonal DRAFT watermark (SVG overlay for images/PDFs, badge for other types) + PLACEHOLDERS.md entry
|
||||
**Generator:** Post-processing step in every content pipeline, not a standalone generator
|
||||
**Manifest fields:** `path`, `type`, `status` (draft/final), `generator`, `prompt_hash`, `generated_at`, `resolved_at`
|
||||
**Complexity:** LOW — SVG watermark overlay is a compositing operation; PLACEHOLDERS.md is an existing format to extend
|
||||
**Resolve flow:** "finalize this asset" removes watermark, updates manifest status to `final`
|
||||
|
||||
### PDF Generation
|
||||
|
||||
**Use case A — Design-rich reports and one-pagers:** HTML template rendered with Playwright headless Chromium → PDF
|
||||
- Supports full CSS (flexbox, grid, custom fonts via @font-face)
|
||||
- Startup cost: 1–3s browser init; reuse browser instance across requests
|
||||
- Output: pixel-accurate PDF matching HTML preview
|
||||
|
||||
**Use case B — Invoices, receipts, data tables:** programmatic construction with `pdf-lib`
|
||||
- No browser dependency; pure Node.js; fast (<200ms for simple documents)
|
||||
- Supports: text positioning, tables, page breaks, image embedding
|
||||
- Output: structured PDF from data objects
|
||||
|
||||
**User trigger:** "Generate a project summary PDF" (→ Playwright) or "Create an invoice for client X" (→ pdf-lib)
|
||||
**Complexity:** MEDIUM (Playwright browser lifecycle management; pdf-lib API for data-driven docs)
|
||||
|
||||
### Wallpapers + Visual Assets
|
||||
|
||||
**Scope:** Desktop wallpaper (2560×1440, 3840×2160), mobile wallpaper (1080×1920), OG image (1200×630), LinkedIn banner (1584×396), Twitter/X header (1500×500)
|
||||
**User trigger:** "Generate a wallpaper for my project" or "Create an OG image with the project name"
|
||||
**Input:** Theme colors + project name/tagline + optional layout style
|
||||
**Output:** PNG files at each requested size
|
||||
**Generator:** Satori (JSX → SVG) → Sharp (SVG → PNG, resize per platform spec)
|
||||
**Preview:** Thumbnail grid in UI; click to view full size; download individual or as ZIP
|
||||
**Complexity:** MEDIUM — Satori requires a subset of CSS (no `display: grid` in all versions; use flexbox); Sharp handles the raster conversion
|
||||
|
||||
### Icon Generation (SVG)
|
||||
|
||||
**User trigger:** "Create an icon for notifications" or "Generate a 5-icon set for the nav bar"
|
||||
**Input:** Icon description + optional style (outline/filled/duotone) + size (24px/32px/48px)
|
||||
**Output:** `.svg` file(s) with clean `<svg>` structure
|
||||
**Generator:** LLM generates SVG path code directly — this is a text-to-code task, not image generation
|
||||
**Preview:** Rendered SVG inline in chat; displayed at 24, 48, 96px to show scalability
|
||||
**Complexity:** LOW — modern LLMs (Claude, GPT-4) reliably generate clean SVG paths for simple icons; more complex icons need iteration
|
||||
**Consistency rule:** Generate entire sets in one LLM call with style instructions; icons generated separately look inconsistent
|
||||
|
||||
### Social Media Content
|
||||
|
||||
**Scope:** Post image + caption + hashtags for LinkedIn, Twitter/X, Instagram (static only)
|
||||
**Platform specs:**
|
||||
- LinkedIn: 1200×628px image, 3000 char limit, 3–5 hashtags
|
||||
- Twitter/X: 1200×675px image, 280 char limit (with image), 2–3 hashtags
|
||||
- Instagram: 1080×1080px (square), 2200 char limit, 10–30 hashtags
|
||||
**User trigger:** "Create a launch announcement post for LinkedIn and Twitter"
|
||||
**Input:** Project description/milestone + platform selection + tone (professional/casual/technical)
|
||||
**Output:** Per platform: `{platform}/image.png` + `{platform}/caption.txt` + `{platform}/hashtags.txt`
|
||||
**Generator:** Satori for platform image → LLM for caption + hashtags
|
||||
**Complexity:** MEDIUM — platform spec registry is straightforward; caption writing via LLM is reliable; image must be platform-safe (no text too close to edge)
|
||||
|
||||
### Remotion Presentations + Video
|
||||
|
||||
**User trigger:** "Create a 2-minute pitch deck video" or "Generate slides for the project demo"
|
||||
**Input:** Slide content (title, bullets, code snippets) + theme + duration estimate
|
||||
**Output:** `.mp4` (for video) or PNG stills per slide (for presentation mode)
|
||||
**Generator:** `@remotion/renderer` `renderMedia()` API — server-side, no browser UI needed
|
||||
**Preview:** Remotion Player component in UI for interactive playback before export
|
||||
**Complexity:** HIGH — Remotion render is CPU-intensive (no GPU on M4 needed; uses CPU rendering); render time for a 2-min video ~30–90s on M4; must manage render queue
|
||||
**VRAM note:** Remotion does NOT use GPU/VRAM; pure CPU/RAM render; does not compete with LLM VRAM budget
|
||||
**ffmpeg reuse:** Remotion uses ffmpeg internally for video encoding; `ffmpeg-static` already in the v1.6 stack satisfies this
|
||||
|
||||
### Branding Media Kit (v2 — complex coordination)
|
||||
|
||||
**Output:** `brand-kit.zip` containing: `colors/theme.css`, `colors/theme.json`, `icons/logo.svg`, `icons/favicon.svg`, `images/og-image.png`, `images/banner-linkedin.png`, `images/banner-twitter.png`, `images/wallpaper-desktop.png`, `typography/font-stack.css`, `copy/brand-voice.md`, `copy/tagline.txt`
|
||||
**Generator:** Orchestrator agent coordinates all sub-generators in sequence
|
||||
**Complexity:** HIGH — coordination of 6 generators with shared state (theme colors must flow through all visual assets)
|
||||
|
||||
---
|
||||
|
||||
## Competitor Feature Analysis
|
||||
|
||||
| Feature | ChatGPT Voice Mode | Telegram + other bots | Nexus v1.6 Approach |
|
||||
|---------|--------------------|-----------------------|---------------------|
|
||||
| STT | Whisper (cloud) | Per-bot (usually cloud) | Whisper local, CPU fallback |
|
||||
| TTS | Custom neural (cloud) | gTTS or ElevenLabs | Piper local, CPU-only |
|
||||
| Markdown-free voice | Yes (GPT strips markdown) | Usually not (bots send raw markdown) | Dual output: SPOKEN + DETAILED |
|
||||
| Silence detection | Yes (VAD, full-duplex) | N/A | Amplitude VAD, 1.5s threshold |
|
||||
| Waveform UI | Animated blobs (not literal waveform) | N/A | AnalyserNode amplitude bars |
|
||||
| Agent identity in replies | N/A (single assistant) | Custom per bot | Text prefix `[AgentName]` |
|
||||
| Telegram voice note support | N/A | Varies widely | OGG→WAV→Whisper→agent |
|
||||
| Offline / local operation | No | No | Fully local: Whisper + Piper + Ollama |
|
||||
| Transport abstraction | N/A | N/A | VoicePipelineService (web + Telegram share same service) |
|
||||
| Feature | Canva / Pitch | Mermaid Live / Eraser | Figma Tokens Studio | Nexus v1.7 Approach |
|
||||
|---------|--------------|----------------------|---------------------|---------------------|
|
||||
| Content type | Raster images, slides | Diagrams only | Design tokens only | All types via skills |
|
||||
| AI integration | Prompt-to-design (cloud) | None / limited | None | Chat-driven, local LLM |
|
||||
| Offline / local | No | No | No | Fully local on M4 |
|
||||
| Skill installability | Monolithic product | Standalone tool | Figma plugin | Per-type installable skills |
|
||||
| File ownership | Cloud-locked | Export only | Figma-locked | Local file system, git-versioned |
|
||||
| WCAG enforcement | Optional check | N/A | Via plugin | Enforced at generation |
|
||||
| PLACEHOLDERS.md | N/A | N/A | N/A | Native; draft tracking built in |
|
||||
| Agent-driven | No | No | No | Core UX: chat → deliverable |
|
||||
|
||||
---
|
||||
|
||||
## Voice Pipeline Architecture Notes
|
||||
## Platform Dimension Registry
|
||||
|
||||
**Confidence:** HIGH for the cascading/sequential pipeline; MEDIUM for dual output prompt engineering reliability.
|
||||
Used by wallpaper generator, social content, and branding kit.
|
||||
|
||||
### Sequential Pipeline (chosen architecture for v1.6)
|
||||
|
||||
```
|
||||
[Browser/Telegram]
|
||||
|
|
||||
| audio buffer (WAV/OGG)
|
||||
v
|
||||
VoicePipelineService.transcribe()
|
||||
|
|
||||
| transcript text + language + confidence
|
||||
v
|
||||
LLM (with voice_mode prompt addendum)
|
||||
|
|
||||
| structured response: SPOKEN: "..." DETAILED: "..."
|
||||
v
|
||||
Response parser → { spoken: string, detailed: string }
|
||||
| |
|
||||
| v
|
||||
| Web chat: render detailed (markdown)
|
||||
| Telegram: send spoken as text
|
||||
v
|
||||
VoicePipelineService.synthesize(spoken)
|
||||
|
|
||||
| WAV audio buffer
|
||||
v
|
||||
Web chat: <audio> element autoplay
|
||||
Telegram (v2): sendVoice() as OGG/Opus
|
||||
```
|
||||
|
||||
### Why not real-time speech-to-speech:
|
||||
|
||||
Real-time requires full-duplex WebSocket audio, interrupt detection (barge-in), turn-taking state machine, and sub-200ms latency budgets. The sequential pattern targets <3s end-to-end on Apple Silicon M4, which is appropriate for assistant interactions (not phone calls). The complexity delta is enormous; PROJECT.md explicitly defers this.
|
||||
| Asset Type | Width | Height | Notes |
|
||||
|------------|-------|--------|-------|
|
||||
| OG Image | 1200 | 630 | Universal (Facebook, LinkedIn, Twitter) |
|
||||
| LinkedIn Banner | 1584 | 396 | Center-safe zone; edges cropped on mobile |
|
||||
| Twitter/X Header | 1500 | 500 | 3:1 aspect ratio |
|
||||
| YouTube Banner | 2560 | 1440 | Safe zone: center 1546×423 |
|
||||
| Instagram Square | 1080 | 1080 | 1:1 |
|
||||
| Desktop Wallpaper | 2560 | 1440 | Standard; also offer 3840×2160 |
|
||||
| Mobile Wallpaper | 1080 | 1920 | 9:16 |
|
||||
| Favicon | 32 | 32 | SVG preferred; PNG fallback |
|
||||
| Apple Touch Icon | 180 | 180 | PNG only |
|
||||
|
||||
---
|
||||
|
||||
## Telegram Bridge Architecture Notes
|
||||
## Generation Job Lifecycle
|
||||
|
||||
**Confidence:** HIGH — Telegraf is the standard Node.js Telegram framework; patterns are well-established.
|
||||
|
||||
### Single Bot, Agent Prefix Pattern
|
||||
All content generation follows this status machine to enable consistent UI feedback:
|
||||
|
||||
```
|
||||
Telegram user sends: "What's the status of the Nexus project?"
|
||||
|
|
||||
Telegraf handler
|
||||
|
|
||||
POST /api/workspaces/:id/chat/messages
|
||||
{ content: "What's the status...", source: "telegram", voice_mode: false }
|
||||
|
|
||||
SSE stream → collect until [DONE]
|
||||
|
|
||||
bot.sendMessage(chatId, "[Hermes] The Nexus project is currently...")
|
||||
queued → generating → ready → (draft → final via placeholder system)
|
||||
↘ error (with structured reason + suggestion)
|
||||
```
|
||||
|
||||
### Voice Note Flow
|
||||
|
||||
```
|
||||
Telegram user sends voice note (OGG/Opus, ~15s)
|
||||
|
|
||||
Telegraf voice handler: bot.getFile() → download OGG
|
||||
|
|
||||
ffmpeg: OGG → WAV (16kHz mono)
|
||||
|
|
||||
VoicePipelineService.transcribe(wavBuffer)
|
||||
|
|
||||
POST /api/workspaces/:id/chat/messages
|
||||
{ content: transcript, source: "telegram", voice_mode: true }
|
||||
|
|
||||
Collect SSE stream → spoken variant of response
|
||||
|
|
||||
bot.sendMessage(chatId, "[Hermes] " + spokenResponse)
|
||||
// v2: bot.sendVoice(chatId, synthesizedOggBuffer)
|
||||
```
|
||||
|
||||
### Key implementation decisions:
|
||||
|
||||
- **Polling vs. webhooks:** Webhooks require a public HTTPS endpoint. For Mac Mini on home network, long polling is the correct choice. Telegraf supports both; use `bot.launch()` (polling mode) for v1.6.
|
||||
- **Bot token storage:** Environment variable `TELEGRAM_BOT_TOKEN`; added to `.env` and loaded via existing env config pattern.
|
||||
- **Authorized users only:** Store allowed Telegram user IDs or usernames in nexus-settings to prevent unauthorized access; a bridge with no auth is a security hole.
|
||||
- **Conversation context:** Each Telegram chat ID maps to a Nexus workspace session; maintain a `telegramChatId → workspaceId + conversationId` mapping in a lightweight in-memory store or SQLite table.
|
||||
|
||||
---
|
||||
|
||||
## Voice Mode Response Formatting Notes
|
||||
|
||||
**Confidence:** MEDIUM — dual output prompt pattern is used in production systems but prompt reliability varies by model; post-processing strip is more reliable.
|
||||
|
||||
### Two approaches, use both as fallback:
|
||||
|
||||
**Approach A: Prompt-based dual output (preferred)**
|
||||
Append to system prompt when `voice_mode: true`:
|
||||
```
|
||||
When responding, provide two versions:
|
||||
SPOKEN: [1-3 sentences in natural spoken prose, no markdown, no symbols, no lists]
|
||||
DETAILED: [Full response with markdown formatting, code blocks, bullet points as needed]
|
||||
```
|
||||
Parse response: split on `SPOKEN:` and `DETAILED:` markers.
|
||||
|
||||
**Approach B: Post-processing strip (fallback)**
|
||||
If the model doesn't follow the dual output format, post-process the full response:
|
||||
- Strip `**bold**` → "bold"
|
||||
- Strip `` `code` `` → "code"
|
||||
- Strip `# headers` → remove `#` prefix
|
||||
- Strip `- ` bullet points → convert to sentences or strip
|
||||
- Strip ``` code blocks ``` → summarize as "[code example]" or remove entirely
|
||||
Use as the spoken variant. The full original markdown response is the detailed variant.
|
||||
|
||||
**Reliable rule:** Never read markdown symbols aloud. Either approach prevents this; dual output is preferred because it lets the LLM choose better phrasing for spoken delivery (short, natural sentences vs. information-dense bullets).
|
||||
- **queued:** Job accepted; worker not yet started
|
||||
- **generating:** Active work; emit SSE progress events with % or step label
|
||||
- **ready:** File available; preview URL returned; download URL available
|
||||
- **draft:** File saved with DRAFT watermark; PLACEHOLDERS.md entry created
|
||||
- **final:** User confirmed; watermark removed; manifest updated
|
||||
- **error:** Structured error with reason + actionable suggestion
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
|
||||
- [The Voice AI Stack for Building Agents (assemblyai.com)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
|
||||
- [One-Second Voice-to-Voice Latency with Modal, Pipecat, and Open Models (modal.com)](https://modal.com/blog/low-latency-voice-bot)
|
||||
- [Voice Chat with Local LLMs: Whisper + TTS (insiderllm.com)](https://www.insiderllm.com/guides/voice-chat-local-llms-whisper-tts/)
|
||||
- [whisper-cpp VAD (ggml-org/whisper.cpp on GitHub)](https://github.com/ggml-org/whisper.cpp)
|
||||
- [Telegram Bot API — sendVoice (core.telegram.org)](https://core.telegram.org/bots/api)
|
||||
- [Convert Voice Memos from Telegram to Text using OpenAI Whisper (dev.to)](https://dev.to/techresolve/solved-convert-voice-memos-from-telegram-to-text-using-openai-whisper-api-41al)
|
||||
- [Telegram speech-to-text bot with Node.js (loonskai.com)](https://www.loonskai.com/blog/telegram-speech-to-text-bot-with-nodejs)
|
||||
- [Telegraf: Modern Telegram Bot Framework for Node.js (telegraf.js.org)](https://telegraf.js.org/)
|
||||
- [HA Voice PE markdown post-processing discussion (community.home-assistant.io)](https://community.home-assistant.io/t/ha-voice-pe-add-post-processing-step-between-conversation-agent-and-speech-to-text-step/893933)
|
||||
- [Two design patterns for Telegram Bots (dev.to/madhead)](https://dev.to/madhead/two-design-patterns-for-telegram-bots-59f5)
|
||||
- [Design voice AI experiences — LiveKit Agents UI (livekit.com)](https://livekit.com/blog/design-voice-ai-interfaces-with-agents-ui)
|
||||
- [User Interaction Patterns in LLM-Powered Voice Assistants (arxiv.org)](https://arxiv.org/html/2309.13879v2)
|
||||
- [voicegram PyPI — OGG/Opus conversion (pypi.org)](https://pypi.org/project/voicegram/)
|
||||
- [Remotion — Make videos programmatically](https://www.remotion.dev/) — server-side rendering, Remotion Player, `@remotion/renderer` API
|
||||
- [Remotion GitHub](https://github.com/remotion-dev/remotion) — Remotion Skills (January 2026), CPU rendering confirmed
|
||||
- [Mermaid CLI npm — @mermaid-js/mermaid-cli](https://www.npmjs.com/package/@mermaid-js/mermaid-cli) — programmatic `run()` API, Node.js >=18, SVG/PNG/PDF output
|
||||
- [Satori GitHub — vercel/satori](https://github.com/vercel/satori) — HTML/CSS to SVG; flexbox subset; use with Sharp for PNG
|
||||
- [Social media image sizes 2026 — SocialSizes.io](https://socialsizes.io/) — platform dimension registry
|
||||
- [Social media image sizes — Buffer 2026](https://buffer.com/resources/social-media-image-sizes/) — OG 1200×630 confirmed universal
|
||||
- [Accessible Palette — accessiblepalette.com](https://accessiblepalette.com/) — OKLCH/LCh for perceptually uniform palette generation
|
||||
- [InclusiveColors — WCAG accessible palette creator](https://www.inclusivecolors.com/) — WCAG AA enforcement pattern
|
||||
- [Generating accessible color palettes — Canonical](https://canonical.design/blog/generating-color-palettes-for-design-systems-inspired-by-apca/) — APCA/WCAG algorithm approaches
|
||||
- [How to Generate PDFs in 2025 — DEV Community](https://dev.to/michal_szymanowski/how-to-generate-pdfs-in-2025-26gi) — Playwright vs pdf-lib use case guidance
|
||||
- [Puppeteer HTML to PDF — RisingStack](https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/) — HTML-to-PDF pattern; applies to Playwright equivalent
|
||||
- [AI SVG Icon Generator — DEV Community](https://dev.to/albert_nahas_cdc8469a6ae8/i-built-an-ai-powered-svg-icon-generator-with-cli-mcp-server-and-web-app-13eo) — LLM-as-SVG-coder pattern validated
|
||||
- [Tracking designs using watermarks — Atlassian](https://medium.com/designing-atlassian/a-status-in-time-saves-nine-tracking-design-using-watermarks-223e14c59128) — DRAFT watermark status pattern in design workflow
|
||||
- [Branding with AI — BrandForge](https://brandforge.me/blog/branding-with-ai-complete-guide) — brand kit component structure (logo, palette, typography, voice, templates)
|
||||
- [Mermaid Chart export guide](https://mermaid.ai/docs/guides/export-diagram) — SVG, PNG, MMD export options confirmed
|
||||
|
||||
---
|
||||
*Feature research for: Nexus v1.6 Voice Pipeline + Minimal Telegram Bridge*
|
||||
*Researched: 2026-04-03*
|
||||
|
||||
*Feature research for: Nexus v1.7 Content Generation*
|
||||
*Researched: 2026-04-04*
|
||||
|
|
|
|||
|
|
@ -2,14 +2,14 @@
|
|||
|
||||
**Domain:** Forked open-source project with display-layer renames, no i18n layer
|
||||
**Researched:** 2026-04-02 (updated for v1.5 milestone: smart onboarding, multi-provider, voice TTS, persistent memory, assistant mode, `npx buildthis`)
|
||||
**Updated:** 2026-04-03 (v1.6 milestone: server-side Whisper STT, server-side Piper TTS, Telegram bridge)
|
||||
**Updated:** 2026-04-04 (v1.7 milestone: content generation — Remotion, image gen, Mermaid, PDF, theme gen, social media, content skills, large file storage)
|
||||
**Confidence:** HIGH — based on direct codebase analysis of `/opt/nexus/` plus targeted research on each new integration domain
|
||||
|
||||
---
|
||||
|
||||
## About This Document
|
||||
|
||||
This file covers pitfalls for the **v1.5 and v1.6 milestone additions**. The original pitfalls (Pitfalls 1–11) covering fork hygiene, display-layer rename discipline, and upstream sync remain valid and are preserved below. Pitfalls 12–26 are new for v1.5. Pitfalls 27–44 are new for v1.6 (voice pipeline + Telegram bridge).
|
||||
This file covers pitfalls for the **v1.5, v1.6, and v1.7 milestone additions**. The original pitfalls (Pitfalls 1–11) covering fork hygiene, display-layer rename discipline, and upstream sync remain valid and are preserved below. Pitfalls 12–26 are new for v1.5. Pitfalls 27–44 are new for v1.6 (voice pipeline + Telegram bridge). Pitfalls 45–66 are new for v1.7 (content generation layer).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -1095,6 +1095,618 @@ ffmpeg is not installed by default on macOS. It is available via Homebrew (`brew
|
|||
| ffmpeg missing on production (43) | v1.6 Phase 1 — Whisper STT | Verify: server logs ffmpeg version on startup; `which ffmpeg` on production machine |
|
||||
| Telegram agent prefix in transcription input (44) | v1.6 Phase 3 — Telegram handler | Verify: bot-originated messages are filtered before the transcription pipeline |
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Critical Pitfalls — v1.7 Content Generation Layer
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 45: Calling bundle() Per Render Request
|
||||
|
||||
**What goes wrong:** `@remotion/bundler`'s `bundle()` function runs Webpack to compile the Remotion composition. When called on every video render request, Webpack runs from scratch each time — taking 2–5 minutes before a single frame is encoded. At two concurrent render requests, the server becomes unresponsive. The first symptom is a request queue that grows indefinitely.
|
||||
|
||||
**Why it happens:** Remotion's SSR docs document `bundle()` and `renderMedia()` as a two-step pipeline. Developers naturally call both steps together per request. The anti-pattern is not obvious because both functions are in the same `@remotion/renderer` + `@remotion/bundler` package and the docs show them sequentially in examples.
|
||||
|
||||
**How to avoid:**
|
||||
1. Call `bundle()` once at server startup (or once when compositions change), cache the bundle path in memory.
|
||||
2. Each render request reuses the cached bundle path and only calls `renderMedia()` with different `inputProps`.
|
||||
3. If compositions change at runtime, invalidate the bundle cache explicitly and re-bundle asynchronously — do not block render requests.
|
||||
4. For the Mac Mini M4 single-user deployment: a startup bundle is fine; no need for elaborate cache invalidation. Re-bundle on process restart.
|
||||
|
||||
**Warning signs:**
|
||||
- `bundle()` call inside the same function/route handler as `renderMedia()`
|
||||
- Render requests taking 3+ minutes for a 30-second video
|
||||
- Server logs showing Webpack compilation on every render
|
||||
- CPU pegged at 100% from the second concurrent render request
|
||||
|
||||
**Phase to address:** Phase 1 (Remotion integration foundation) — bundle caching must be established before any render endpoint is exposed.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 46: Remotion Chromium Concurrency Thrashing on Mac Mini M4
|
||||
|
||||
**What goes wrong:** Remotion spawns one headless Chromium instance per concurrent render frame by default. `concurrency: "100%"` on a 10-core M4 spawns 10 Chrome instances. Each Chromium instance uses ~200–400MB RAM. At 10 instances rendering a complex composition with video assets, the Mac Mini (16GB RAM) hits memory pressure, macOS begins swapping, and render times increase 3–10x. The system may become temporarily unresponsive to UI requests.
|
||||
|
||||
**Why it happens:** Remotion's concurrency model is designed for cloud rendering where the machine has many cores. On a shared personal machine running the full Nexus server stack (Node.js server, Hermes/Ollama, UI), the available RAM for rendering is significantly less than total system RAM.
|
||||
|
||||
**How to avoid:**
|
||||
1. Set `concurrency: 4` as the default for the Mac Mini M4 (leaves ~8 cores for other processes).
|
||||
2. Run `npx remotion benchmark` against the specific composition type to find the actual optimal concurrency for the hardware.
|
||||
3. Do not run Remotion renders concurrently with heavy Ollama inference — implement a simple render queue that checks if an Ollama session is active before starting a render.
|
||||
4. In headless mode, Chromium disables GPU acceleration by default (software rasterization). This is slower but more memory-stable than GPU mode for this use case.
|
||||
|
||||
**Warning signs:**
|
||||
- System becoming sluggish during video render
|
||||
- Memory pressure in Activity Monitor during render
|
||||
- Render time increasing non-linearly with video length
|
||||
- `concurrency` not set (defaults to 100% of cores)
|
||||
|
||||
**Phase to address:** Phase 1 (Remotion integration) — concurrency configuration must be set before first production render test.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 47: Bundling Remotion Inside an Already-Bundled Server Context
|
||||
|
||||
**What goes wrong:** The Nexus server is built with `tsc` or `esbuild` into a `dist/` directory and run from there. Remotion's `bundle()` function calls Webpack internally and must be invoked from a non-bundled context with access to the raw source file entry point. When `bundle()` is called from inside the compiled server bundle, it cannot find the Remotion composition source files and throws path resolution errors or silently produces empty bundles.
|
||||
|
||||
**Why it happens:** `bundle()` requires an absolute path to the Remotion entry point (the `.tsx` file). When the server is compiled, `__dirname` and relative paths change. The Remotion entry point lives in the UI package (`ui/src/remotion/`) but the server calls `bundle()` — a cross-package path dependency that breaks after compilation.
|
||||
|
||||
**How to avoid:**
|
||||
1. Keep the Remotion composition source files in a dedicated `packages/remotion-compositions/` package that is never compiled (stays as TypeScript source).
|
||||
2. Pass the absolute path to this package as a config value (`REMOTION_COMPOSITIONS_PATH`) rather than computing it from `__dirname` at runtime.
|
||||
3. In the server, resolve the entry point at startup and log it: `const entryPoint = path.resolve(process.env.REMOTION_COMPOSITIONS_PATH, 'index.ts')`. Fail fast if it does not exist.
|
||||
4. Run `bundle()` in a separate worker process or child process — never inline in the main Express server process.
|
||||
|
||||
**Warning signs:**
|
||||
- `bundle()` working in development (ts-node, pnpm dev) but failing after `pnpm build`
|
||||
- Path resolution errors pointing to `dist/` subdirectories for Remotion entry
|
||||
- Webpack "module not found" errors for composition files during server-side render
|
||||
|
||||
**Phase to address:** Phase 1 (Remotion integration) — entry point resolution strategy must be validated in the compiled server build before any further Remotion work.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 48: 10MB File Size Limit Blocks Video and Large Image Storage
|
||||
|
||||
**What goes wrong:** The existing Nexus/Paperclip storage layer enforces a 10MB maximum file size (`MAX_ATTACHMENT_BYTES = 10 * 1024 * 1024` in `server/src/attachment-types.ts`). A 30-second 1080p video rendered by Remotion is typically 20–200MB. A high-quality wallpaper image at 4K is 5–30MB. Any attempt to store a generated video or large image through the existing attachment/assets upload routes returns HTTP 422 with "File exceeds 10485760 bytes".
|
||||
|
||||
Additionally, `video/mp4` and other video MIME types are not in `DEFAULT_ALLOWED_TYPES`. Both the byte limit and the MIME type allowlist must be extended.
|
||||
|
||||
**Why it happens:** The original limit was set for user-uploaded document attachments (PDFs, images for chat). Generated content is structurally different — it is produced by the system, not uploaded by the user — but routes through the same storage pipeline.
|
||||
|
||||
**How to avoid:**
|
||||
1. Create a separate storage namespace for generated content: `namespace: "generated"` with its own size limits (e.g., 500MB per file, 5GB total per workspace).
|
||||
2. Do not modify `MAX_ATTACHMENT_BYTES` globally — it is the correct limit for user attachments. Add a parallel constant `MAX_GENERATED_ASSET_BYTES`.
|
||||
3. Add video MIME types to the allowed set for the generated assets route only: `video/mp4`, `video/webm`.
|
||||
4. For Remotion output: write directly to the storage provider using `putObject` after render completes, bypassing the upload multipart route entirely. The render runs server-side; no HTTP upload is needed.
|
||||
5. Add a manifest record linking the generated asset to its originating task/issue so the file can be garbage-collected when the task is deleted.
|
||||
|
||||
**Warning signs:**
|
||||
- HTTP 422 errors when the server tries to store generated video
|
||||
- `video/mp4` silently rejected by `isAllowedContentType()`
|
||||
- Large generated images silently truncated or rejected
|
||||
- Trying to POST a 50MB video through the existing `/api/companies/:id/assets` upload route
|
||||
|
||||
**Phase to address:** Phase 1 (Storage and file size foundations) — must be resolved before any content type produces files larger than 10MB.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 49: Mermaid securityLevel "loose" Enabling XSS to RCE
|
||||
|
||||
**What goes wrong:** Mermaid diagrams rendered with `securityLevel: "loose"` allow `click` directives that execute arbitrary JavaScript. In an Electron-based or server-rendered context, this becomes remote code execution. In 2025–2026, multiple production apps (OneUptime, DeepChat) were exploited through this vector. The natural language → Mermaid pipeline means AI-generated diagram syntax reaches the renderer — AI models can be prompted to include malicious `click` directives.
|
||||
|
||||
Per-diagram `%%{init: {"securityLevel": "loose"}}%%` directives can override the global setting, so even a "strict" default can be bypassed if the diagram source is not sanitized before passing to `mermaid.render()`.
|
||||
|
||||
**Why it happens:** "loose" mode is documented as enabling "interactive diagrams." Developers enable it to support click events in presentations. The security implication is not obvious from the API surface. AI-generated Mermaid is treated like static diagram syntax rather than untrusted input.
|
||||
|
||||
**How to avoid:**
|
||||
1. Always use `securityLevel: "strict"` globally — no exceptions.
|
||||
2. Before passing any Mermaid source (including AI-generated) to `mermaid.render()`, strip `%%{init}%%` directives and `click` statements using a regex preprocessor.
|
||||
3. After `mermaid.render()` returns SVG, sanitize the SVG output with DOMPurify (using `isomorphic-dompurify` for Node.js server-side rendering) before storing or returning to the client.
|
||||
4. Treat all Mermaid source as untrusted input regardless of origin — even AI-generated diagrams can be manipulated via prompt injection.
|
||||
|
||||
**Warning signs:**
|
||||
- `securityLevel: "loose"` anywhere in Mermaid config
|
||||
- Mermaid source passed directly to `mermaid.render()` without preprocessing
|
||||
- No SVG sanitization step after render
|
||||
- `%%{init}%%` directives in AI-generated diagram source not stripped
|
||||
|
||||
**Phase to address:** Phase 3 (Mermaid diagram generation) — security config must be locked before any diagram rendering is exposed to the UI.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 50: DOMPurify Server-Side Memory Accumulation with JSDOM
|
||||
|
||||
**What goes wrong:** Server-side SVG sanitization with DOMPurify requires a DOM environment. The standard approach is `isomorphic-dompurify` backed by JSDOM. In a long-running Node.js process, each `DOMPurify.sanitize()` call accumulates DOM state inside the JSDOM window object. Over hundreds of diagram renders, the JSDOM window grows unboundedly, causing progressive memory increase and eventual OOM.
|
||||
|
||||
Additionally, using `happy-dom` instead of JSDOM as the DOM provider is documented as unsafe and likely to produce XSS bypasses.
|
||||
|
||||
**Why it happens:** JSDOM is designed for single-use in tests, not as a long-running in-process DOM. The memory accumulation is subtle — no immediate crash, just gradual slowdown.
|
||||
|
||||
**How to avoid:**
|
||||
1. Use `isomorphic-dompurify` with JSDOM (not `happy-dom`).
|
||||
2. After every N sanitization calls (e.g., 100), call the window cleanup method to release JSDOM state. Alternatively, create a fresh JSDOM window per sanitization batch.
|
||||
3. For server-side diagram rendering, prefer rendering to SVG in a sandboxed child process (using the existing `plugin-worker-manager.ts` pattern) rather than in the main server process. The child process's memory is fully released on exit.
|
||||
4. Pin JSDOM to version 20+ — version 19 has known attack vectors that allow XSS even with DOMPurify correctly applied.
|
||||
|
||||
**Warning signs:**
|
||||
- Server heap growing steadily during diagram render load testing
|
||||
- Using `happy-dom` as the DOMPurify DOM provider
|
||||
- JSDOM version < 20 in `package.json`
|
||||
- No DOM cleanup between sanitization calls in a long-running process
|
||||
|
||||
**Phase to address:** Phase 3 (Mermaid diagram generation) — establish the server-side sanitization pattern with memory management before rendering is enabled in production.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 51: HSL-Based Color Palette Generation Producing Perceptually Incoherent Themes
|
||||
|
||||
**What goes wrong:** A theme generator takes a brand color and generates a full palette by rotating hue in HSL space (e.g., complementary colors at +180°, triadic at +120°/+240°, tints by varying L). The generated palette looks visually unbalanced: some colors appear much brighter or darker than others even though their HSL lightness values are identical. A blue at L=50% looks significantly darker than a yellow at L=50%.
|
||||
|
||||
WCAG contrast calculations on these palettes pass numerically but the palette feels wrong to human designers, leading to rejection of the feature.
|
||||
|
||||
**Why it happens:** HSL is not perceptually uniform. Equal numeric steps in HSL lightness do not correspond to equal perceived brightness changes. This is a well-known limitation documented by the CSS working group. Tailwind CSS 4.0 moved away from HSL to OKLCH for exactly this reason.
|
||||
|
||||
**How to avoid:**
|
||||
1. Use OKLCH (OKLab with cylindrical coordinates) for all palette generation operations. OKLCH is available via the `culori` npm library which is zero-dependency and TypeScript-native.
|
||||
2. Generate tints/shades by varying L in OKLCH space (perceptually uniform lightness), not in HSL.
|
||||
3. Generate complementary/analogous colors by rotating H in OKLCH space.
|
||||
4. Convert to HEX/RGB for output and storage — OKLCH is the computation space, not the output format.
|
||||
5. Do not use HSL as an intermediate — go HEX input → OKLCH computation → HEX output.
|
||||
|
||||
**Warning signs:**
|
||||
- Using `hsl()`, `chroma-js` with HSL operations, or manual `(h + 180) % 360` hue rotation
|
||||
- Palette colors appearing visually unbalanced (some look brighter/darker than intended)
|
||||
- Design review rejecting AI-generated palettes as "off"
|
||||
|
||||
**Phase to address:** Phase 4 (Theme and palette generator) — color space selection is a foundation decision; switching after palette logic is built requires rewriting all generation functions.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 52: WCAG Contrast Ratio Computed on sRGB Without Linearization
|
||||
|
||||
**What goes wrong:** WCAG 2.x contrast ratio requires computing relative luminance from sRGB values. The correct computation linearizes the 8-bit channel value: values ≤ 0.04045 divide by 12.92; values > 0.04045 apply `((v + 0.055) / 1.055) ^ 2.4`. Developers frequently skip the linearization step and compute luminance directly from the 0–255 byte values, producing incorrect contrast ratios. A pair that calculates as "passing WCAG AA (4.5:1)" may actually fail when correctly computed.
|
||||
|
||||
A secondary mistake: the WCAG 2.x specification itself uses `0.03928` as the threshold (instead of the correct sRGB standard `0.04045`). For 8-bit values, the difference affects one channel value (decimal 10 maps differently). Using 0.03928 produces incorrect results for that specific edge case.
|
||||
|
||||
**Why it happens:** The WCAG spec formula is copy-pasted from W3C documentation which contains the erroneous 0.03928 threshold. Most online "WCAG contrast calculators" also use the incorrect threshold, reinforcing the mistake.
|
||||
|
||||
**How to avoid:**
|
||||
1. Use `culori`'s built-in WCAG functions (`wcagContrast()`, `wcagLuminance()`) which implement the correct linearization.
|
||||
2. If implementing manually, use threshold `0.04045` (not `0.03928`) and ensure linearization happens on normalized 0–1 values (not 0–255 integers).
|
||||
3. Cross-validate computed ratios against [WebAIM Contrast Checker](https://webaim.org/resources/contrastchecker/) for known color pairs during development.
|
||||
4. For the upcoming WCAG 3.0 / APCA standard: note that APCA uses different weights (0.2126729, 0.7151522, 0.0721750) and a polarity-sensitive formula. Use `@colour-contrast/apca` if APCA compliance is needed.
|
||||
|
||||
**Warning signs:**
|
||||
- Contrast ratio formula not including the linearization conditional branch
|
||||
- Using raw 0–255 integer values in luminance calculation (missing `/255` normalization)
|
||||
- Threshold of `0.03928` in the linearization formula
|
||||
- No cross-validation against known-good reference calculator
|
||||
|
||||
**Phase to address:** Phase 4 (Theme generator) — validated in the WCAG AA export check step before theme output is presented to the user.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 53: PDF Generation Chromium Font Loading Failures in Headless Environments
|
||||
|
||||
**What goes wrong:** PDF generation via Puppeteer/Chromium-headless renders HTML to PDF. The generated PDF uses a specific brand font (e.g., Inter, a custom typeface). In development on the Mac Mini, the font is installed system-wide and loads correctly. In the production server process (started via launchctl), the font is not in the headless Chromium font search path. The PDF renders with a fallback system font, producing different page layouts and line breaks than the designed template — tables overflow, headings reflow, and the PDF looks broken.
|
||||
|
||||
**Why it happens:** Headless Chromium uses its own font resolution paths, not the macOS font manager. User-installed fonts in `~/Library/Fonts` are not accessible to headless Chromium without explicit configuration. The failure is environment-dependent and invisible in development.
|
||||
|
||||
**How to avoid:**
|
||||
1. Bundle all fonts used in PDF templates as static assets in the Nexus codebase (e.g., `packages/pdf-templates/fonts/`). Self-host them via the Express static server.
|
||||
2. Reference fonts in PDF templates using `@font-face` with explicit `src: url('http://localhost:PORT/fonts/...')` — absolute localhost URLs, not relative paths.
|
||||
3. In the Puppeteer page setup, call `page.waitForNetworkIdle()` after navigation to ensure fonts are loaded before calling `page.pdf()`.
|
||||
4. Add a font smoke test: render a one-page PDF at startup and verify the font name embedded in the PDF metadata matches the expected font.
|
||||
|
||||
**Warning signs:**
|
||||
- PDF layout differs between `pnpm dev` and production server
|
||||
- System fonts used instead of brand fonts in generated PDFs
|
||||
- `@font-face` with relative URLs (`./fonts/Inter.woff2`) in PDF templates
|
||||
- No `waitForNetworkIdle()` before `page.pdf()` call
|
||||
|
||||
**Phase to address:** Phase 5 (PDF document generation) — font strategy must be defined before any PDF template is considered complete.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 54: Puppeteer Instance Not Reused Across PDF Render Requests
|
||||
|
||||
**What goes wrong:** Each PDF render request calls `puppeteer.launch()` to create a new browser instance, renders the page, and calls `browser.close()`. Launching a Chromium instance takes 0.5–2 seconds. For a feature that generates PDFs on demand (invoice on task completion, report at end of sprint), this adds significant latency to each render. At 3 concurrent PDF requests, 3 Chromium instances start simultaneously — using ~800MB RAM and 3 full startup sequences.
|
||||
|
||||
**Why it happens:** The code examples in Puppeteer documentation show `launch()` → `newPage()` → `close()` as the simple unit. Reuse is an optimization not shown in introductory examples.
|
||||
|
||||
**How to avoid:**
|
||||
1. Maintain a single persistent Puppeteer browser instance at the server level (similar to the Piper TTS persistent process pattern from v1.6).
|
||||
2. Use `browser.newPage()` per render request and `page.close()` when done — do not close the browser between requests.
|
||||
3. Add a health check: if the browser crashes, restart it automatically (the same backoff pattern used in `plugin-worker-manager.ts`).
|
||||
4. Limit concurrent PDF pages to 2–3 via a semaphore to prevent RAM exhaustion.
|
||||
|
||||
**Warning signs:**
|
||||
- `puppeteer.launch()` inside the route handler or per-request function
|
||||
- High memory and CPU spikes on PDF requests visible in Activity Monitor
|
||||
- PDF generation latency >3 seconds for a simple one-page document
|
||||
- No browser lifecycle management (launch once, keep alive)
|
||||
|
||||
**Phase to address:** Phase 5 (PDF generation) — establish browser lifecycle pattern before any PDF template work begins.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 55: Remotion Video File Not Streamable Before Full Render Completes
|
||||
|
||||
**What goes wrong:** Remotion's `renderMedia()` produces a video file only after the entire render is complete. For a 2-minute pitch deck video, this takes 3–10 minutes on the Mac Mini M4. During rendering, the user sees no progress indicator and cannot access even the first few seconds of the video. If the render fails at frame 450 of 3600, all progress is lost with no partial recovery.
|
||||
|
||||
A secondary issue: the rendered video is written to a temp file by default. If the server process crashes or is restarted during a long render, the temp file is orphaned with no manifest record.
|
||||
|
||||
**Why it happens:** Remotion's architecture renders all frames, then encodes. There is no streaming output during rendering. Progress is available via the `onProgress` callback but developers often don't wire it up.
|
||||
|
||||
**How to avoid:**
|
||||
1. Always use the `onProgress` callback in `renderMedia()` to emit render progress via SSE to the UI. The existing `live-events-ws.ts` realtime layer can carry these events.
|
||||
2. Write the output to a deterministic path based on a render job ID (not a temp path): `storage/generated/{jobId}/output.mp4`. Create the manifest record before render starts, not after.
|
||||
3. Implement a render job table in the DB (or a simple in-memory map for the single-user case) with states: `queued → rendering → done → failed`. Store frame progress in the record.
|
||||
4. For failed renders, keep the manifest record with `status: "failed"` and the error message. Do not silently discard.
|
||||
|
||||
**Warning signs:**
|
||||
- `renderMedia()` called without `onProgress` callback
|
||||
- Output path using `tmpdir()` or random temp file
|
||||
- No manifest record created before render starts
|
||||
- UI shows no progress during render (user cannot tell if server is working)
|
||||
|
||||
**Phase to address:** Phase 1 (Remotion integration) — progress reporting and job lifecycle management must be designed before any rendering is implemented.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 56: Social Media Image Dimensions and MIME Type Constraints Ignored
|
||||
|
||||
**What goes wrong:** A "social media post" generator produces a 1200×628 OG image and outputs it as PNG. The Instagram API rejects it: Instagram accepts JPEG only, not PNG, for feed posts. Twitter/X accepts up to 5MB for photos but the three-step media upload flow (INIT → APPEND chunks → FINALIZE) is required for anything over ~1MB — a direct upload fails. The 2025 Instagram rate limit reduction from 5,000 to 200 API calls/hour was unannounced and broke production apps; the generator does not account for this and hammers the API during batch generation.
|
||||
|
||||
**Why it happens:** Platform-specific requirements are scattered across documentation pages that are updated without notice. Developers test with a single post and discover constraints only when attempting bulk generation or hitting edge cases in image format.
|
||||
|
||||
**How to avoid:**
|
||||
1. Encode platform constraints as explicit data structures in the skill:
|
||||
```typescript
|
||||
const PLATFORM_SPECS = {
|
||||
instagram: { format: 'jpeg', maxBytes: 8_388_608, dimensions: { feed: [1080, 1080], story: [1080, 1920] } },
|
||||
twitter: { format: 'jpeg_or_png', maxBytes: 5_242_880, useChunkedUpload: true },
|
||||
linkedin: { format: 'jpeg_or_png', maxBytes: 10_485_760 },
|
||||
}
|
||||
```
|
||||
2. Convert all output images to JPEG at the generation step for cross-platform compatibility.
|
||||
3. Implement a rate-limit-aware upload queue with per-platform buckets. For Instagram: max 100 API publishes per 24-hour rolling window, 200 API calls per hour.
|
||||
4. For Twitter/X: always use the chunked upload flow (INIT+APPEND+FINALIZE) regardless of file size — it is more reliable than the simple upload endpoint.
|
||||
|
||||
**Warning signs:**
|
||||
- Generating PNG images for Instagram posting
|
||||
- Simple single-request media upload (not chunked) for Twitter/X
|
||||
- No rate limit tracking between API calls
|
||||
- Platform spec constants hardcoded as magic numbers scattered through posting code
|
||||
|
||||
**Phase to address:** Phase 6 (Social media content generation) — platform specs table must be defined as the first step, before any image generation or posting code is written.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 57: Content Skills Bypassing Plugin Capability Enforcement
|
||||
|
||||
**What goes wrong:** A content generation skill is implemented as a Paperclip plugin. During development, the plugin worker directly calls internal server routes (e.g., `fetch('http://localhost:PORT/api/companies/...')`) or imports server-side modules (`import { storageService } from '../../server/src/services/storage'`). This works in development but violates the plugin isolation contract: plugins must only communicate with the host via the JSON-RPC bridge defined in the plugin SDK. Direct HTTP calls bypass capability checks and audit logging.
|
||||
|
||||
A related mistake: the plugin stores generated file bytes in `ctx.state` (the plugin key-value state store). `ctx.state` uses the `plugin_state` DB table and is designed for small JSON blobs (configuration, counters, IDs). A 50MB video stored in `ctx.state` as a base64 string will cause severe DB performance degradation and hits PostgreSQL row size limits.
|
||||
|
||||
**Why it happens:** The host-side storage service is accessible from the same process. Developers shortcut the plugin boundary during rapid prototyping. `ctx.state` feels like the obvious place to persist plugin data.
|
||||
|
||||
**How to avoid:**
|
||||
1. Content skills must use `ctx.host.storage.*` RPC methods (when these are added to the plugin SDK for v1.7) to store generated files — never direct HTTP or module imports.
|
||||
2. `ctx.state` is for metadata only: store the asset's `objectKey`, `contentType`, `byteSize`, and `generationParams` as JSON. Never store binary content in state.
|
||||
3. Add a lint rule or TS path alias that prevents `@paperclipai/plugin-sdk` packages from importing from `../../server/`.
|
||||
4. Review the plugin manifest `capabilities` array before each phase: a content skill generating PDFs needs `plugin.storage.write` but does not need `plugin.agents.read`.
|
||||
|
||||
**Warning signs:**
|
||||
- `fetch()` calls to `http://localhost` inside plugin worker code
|
||||
- `import` statements in plugin code referencing `../../server/` paths
|
||||
- Binary content or large strings stored in `ctx.state`
|
||||
- Plugin manifest with overly broad capabilities (`*` or all capabilities listed)
|
||||
|
||||
**Phase to address:** Phase 2 (Content skills architecture) — plugin boundary rules must be defined before any content skill implementation begins.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 58: Image Generation Model Loaded Per Request Without VRAM Management
|
||||
|
||||
**What goes wrong:** A local image generation endpoint loads the SDXL or Flux model on each request: `model = load_model('flux-dev')`. On the Mac Mini M4 (18–32GB unified memory), loading a 12GB model takes 8–15 seconds and allocates most available memory. If a second image request arrives during model loading, the second load attempt fails or causes memory exhaustion. When the request completes, the model is garbage-collected — only to be reloaded for the next request.
|
||||
|
||||
**Why it happens:** Stateless request handler pattern (load → infer → unload) is the natural first implementation. VRAM/unified memory management is not visible at the application layer.
|
||||
|
||||
**How to avoid:**
|
||||
1. Load the image generation model once at startup (or on first use, then keep in memory). Never reload per request.
|
||||
2. Use a semaphore to ensure only one inference runs at a time on the M4 — Apple Silicon unified memory does not support concurrent model instances efficiently.
|
||||
3. For the M4's unified memory architecture: the model's memory is shared with system RAM. Monitor memory pressure via `os.totalmem()` / `os.freemem()` and emit a warning if free memory falls below 4GB before starting inference.
|
||||
4. If multiple model sizes are available, load the smallest acceptable model by default. Allow the user to select a higher quality model explicitly (with a warning about inference time).
|
||||
5. Implement a simple LRU model cache: if two different models are needed (e.g., icon generation uses a different model than photo generation), keep the most recently used loaded and unload the least recently used when switching.
|
||||
|
||||
**Warning signs:**
|
||||
- Model loading call inside the request handler function
|
||||
- No semaphore or mutex around inference
|
||||
- Memory exhaustion errors on concurrent image generation requests
|
||||
- Model reload happening on every request (check logs for "Loading model..." appearing multiple times)
|
||||
|
||||
**Phase to address:** Phase 7 (Local image generation) — model lifecycle management must be established before any inference endpoint is exposed.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 59: Mermaid Server-Side Rendering Requiring Full DOM in Node.js
|
||||
|
||||
**What goes wrong:** `mermaid.render()` requires a browser DOM environment. In Node.js server-side rendering (for SVG-to-PNG conversion or PDF embedding), calling `mermaid.render()` in the main Node.js process throws "document is not defined". The common workaround — using JSDOM — requires additional setup and has known limitations with Mermaid's SVG rendering (complex diagrams with foreignObject elements may not render correctly).
|
||||
|
||||
An alternative approach (spawning a headless Chromium page via Puppeteer to render Mermaid client-side, then extracting the SVG) adds Chromium as a dependency for what should be a lightweight diagram operation, and reintroduces the Puppeteer lifecycle pitfalls.
|
||||
|
||||
**Why it happens:** Mermaid is designed as a browser library. Its server-side story is underdeveloped — the GitHub issue tracking server-side SVG rendering with JSDOM has been open since 2023 with no complete resolution.
|
||||
|
||||
**How to avoid:**
|
||||
1. Use the `@mermaid-js/mermaid-zenuml` pattern with `svgdom` (not JSDOM) for server-side Mermaid rendering — `svgdom` is purpose-built for SVG rendering in Node.js and produces more accurate output for Mermaid.
|
||||
2. Alternatively, use the `mmdc` CLI (`@mermaid-js/mermaid-cli`) as a child process: `mmdc -i input.mmd -o output.svg`. This uses Puppeteer internally but encapsulates the DOM requirement. Reuse the Puppeteer instance from the PDF generator to avoid double-launching Chromium.
|
||||
3. For the Nexus use case (agent generates diagram description → Mermaid → embedded in PDF or displayed in UI): render server-side for PDF embedding, render client-side (in the browser) for UI display. These are two separate code paths.
|
||||
4. Cache rendered SVGs by Mermaid source hash — the same diagram definition always produces the same SVG.
|
||||
|
||||
**Warning signs:**
|
||||
- Calling `mermaid.render()` in Node.js without DOM setup
|
||||
- JSDOM used for Mermaid rendering (prone to foreignObject failures)
|
||||
- Separate Chromium launch just for Mermaid (missed opportunity to reuse PDF's browser instance)
|
||||
- No SVG cache — same diagram re-rendered on every page load
|
||||
|
||||
**Phase to address:** Phase 3 (Mermaid diagram generation) — server-side rendering approach must be validated before the diagram feature is integrated with PDF generation.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 60: Agent Heartbeat Timeout Too Short for Long-Running Content Renders
|
||||
|
||||
**What goes wrong:** The Paperclip agent heartbeat model is designed for short execution windows. An agent checks out a task, starts a video render job (3–10 minutes), and the heartbeat timeout fires before the render completes. The heartbeat process exits; the render continues as an orphan child process. The task remains `in_progress` indefinitely. The next heartbeat re-checks the task and either starts a second render (wasting resources and producing duplicates) or reports as blocked.
|
||||
|
||||
**Why it happens:** The heartbeat model assumes agent work completes within a few minutes. Content generation tasks (video rendering, batch image generation, document compilation) violate this assumption. The existing patterns for long-running operations (e.g., git worktree operations) use a different lifecycle model.
|
||||
|
||||
**How to avoid:**
|
||||
1. Content generation tasks must use an async fire-and-forget pattern: the agent heartbeat starts the job, writes the job ID to the task's document, sets status to `in_progress`, and exits. A separate polling routine (using Paperclip's cron/routines feature) checks job status and updates the task to `done` when the render completes.
|
||||
2. Alternatively, use the execution workspace's `workspace-operations.ts` long-running operation pattern for renders — this is already designed for multi-minute operations.
|
||||
3. Never await a render inside a heartbeat handler. Use `renderMedia({ ...options }).then(onComplete).catch(onError)` with the completion callbacks posting a comment to the issue and updating status.
|
||||
4. Add a job ID to the task comment immediately after starting the render: "Render started. Job ID: `{jobId}`. Expected completion: ~5 minutes."
|
||||
|
||||
**Warning signs:**
|
||||
- `await renderMedia(...)` inside a heartbeat route handler
|
||||
- Heartbeat timeout shorter than the expected render time
|
||||
- Orphaned render processes after heartbeat exits (check `ps aux | grep remotion`)
|
||||
- Tasks stuck in `in_progress` after render completes
|
||||
|
||||
**Phase to address:** Phase 1 (Remotion integration) — async job model must be designed before the first render is attempted through the agent interface.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 61: Placeholder Assets Without DRAFT Watermark Mistaken for Final Output
|
||||
|
||||
**What goes wrong:** An agent generates a placeholder for a video (a static slide with "DRAFT" intent) while the real render is queued. The placeholder is stored and linked to the task. A user reviews the task output, sees a static image, and marks the task as approved — not realizing it is a placeholder pending a full render. The actual video is never triggered because the task is now `done`.
|
||||
|
||||
**Why it happens:** Placeholder assets look similar to real output in the task file list. Without a clear visual indicator and a machine-readable flag, humans and agents alike cannot distinguish "this is final" from "this is a placeholder for X".
|
||||
|
||||
**How to avoid:**
|
||||
1. Store a `isDraft: true` flag in the asset manifest for all placeholder assets. Include this flag in the API response for asset listings.
|
||||
2. Render a visible "DRAFT" overlay directly into placeholder images/videos — not just in the filename. Use `sharp` to composite a semi-transparent "DRAFT" watermark on generated placeholder images.
|
||||
3. In the UI asset list, show a distinct badge (yellow "DRAFT" tag) for assets with `isDraft: true`.
|
||||
4. The agent that queued a render should not mark the parent task as `done` until the render completes and the `isDraft` flag is cleared. Use the job polling routine (Pitfall 60) to trigger the status update.
|
||||
|
||||
**Warning signs:**
|
||||
- Placeholder assets stored without a `isDraft` or `status` field
|
||||
- UI showing placeholder and final assets identically
|
||||
- Tasks marked `done` while the render job is still `queued` or `rendering`
|
||||
- No visual DRAFT indicator in placeholder file content
|
||||
|
||||
**Phase to address:** Phase 2 (Placeholder assets and manifest tracking) — the DRAFT flag and visual indicator must be in place before any placeholder is stored.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 62: Theme Export Using HEX Values That Lose Color Space Information
|
||||
|
||||
**What goes wrong:** A theme generator computes colors in OKLCH (perceptually uniform), validates WCAG contrast ratios, and produces a beautiful palette. It then exports the theme as a set of HEX values. Downstream consumers (CSS custom properties, design system tokens, Tailwind config) receive the HEX values and regenerate their own tints/shades — using HSL, because that is what most tools default to. The palette is immediately corrupted by the round-trip through the wrong color space.
|
||||
|
||||
**Why it happens:** HEX is the universal color exchange format. The perceptual uniformity of OKLCH is lost when values are converted to HEX and then re-processed by tools that use HSL.
|
||||
|
||||
**How to avoid:**
|
||||
1. Export theme tokens in multiple formats simultaneously: HEX (for compatibility), OKLCH (for tools that support it), and CSS custom properties with `oklch()` syntax.
|
||||
2. For Tailwind config export: Tailwind 4 supports OKLCH natively in the config — export `oklch(L% C H)` strings directly.
|
||||
3. For CSS variable exports: `--color-primary: oklch(0.65 0.15 250);` — modern browsers support this.
|
||||
4. Mark the HEX export as "sRGB approximate" in export metadata so consumers know it is lossy.
|
||||
5. Store the OKLCH source values in the theme manifest, not just HEX. The HEX representation is derived output.
|
||||
|
||||
**Warning signs:**
|
||||
- Theme manifest storing only HEX values
|
||||
- No OKLCH export format in the theme exporter
|
||||
- Downstream tools re-deriving tints from the exported HEX using HSL
|
||||
- Palette looking "off" after importing into Figma or Tailwind config
|
||||
|
||||
**Phase to address:** Phase 4 (Theme generator) — export format design must include OKLCH from the start. Retrofitting after the exporter is built requires changes to all downstream consumers.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 63: pnpm-lock.yaml Merge Conflicts When Adding Remotion to Monorepo
|
||||
|
||||
**What goes wrong:** Remotion pulls in a large dependency tree: Webpack, Chromium binaries (via `@remotion/renderer`), React (specific version), and multiple `@remotion/*` sub-packages. Adding these to the monorepo's `pnpm-lock.yaml` produces a large lockfile diff. The next upstream rebase (`git rebase upstream/master`) that also touches `pnpm-lock.yaml` produces a conflicted lockfile that cannot be auto-merged. The manual merge is error-prone — resolving lockfile conflicts incorrectly causes `pnpm install` to fail with dependency resolution errors.
|
||||
|
||||
**Why it happens:** The Nexus fork performs periodic rebases onto upstream Paperclip. Both branches add/update dependencies and produce lockfile diffs. Lockfile merge conflicts in pnpm are notoriously difficult because a single dependency change can cascade across hundreds of lockfile lines.
|
||||
|
||||
**How to avoid:**
|
||||
1. Add all Remotion dependencies in a single commit immediately after an upstream rebase (while the lockfile is clean). This minimizes the conflict surface for the next rebase.
|
||||
2. For Remotion's Chromium binary (`@remotion/renderer`): add it as a devDependency of a dedicated `packages/remotion-renderer/` package, isolated from the rest of the monorepo. This limits the lockfile impact to one sub-package.
|
||||
3. On lockfile conflicts: do not attempt to manually merge. Run `pnpm install --no-frozen-lockfile` after resolving `package.json` conflicts — pnpm regenerates the lockfile automatically.
|
||||
4. After each upstream rebase, run `pnpm build` and `pnpm test` to verify the lockfile regeneration did not introduce version regressions.
|
||||
|
||||
**Warning signs:**
|
||||
- Remotion dependencies added to the root `package.json` (adds to every workspace's resolution)
|
||||
- Lockfile conflict during rebase with hundreds of conflicted lines
|
||||
- Attempting to manually edit `pnpm-lock.yaml` to resolve conflicts
|
||||
|
||||
**Phase to address:** Phase 1 (Remotion integration) — dependency isolation strategy must be decided before installing Remotion packages.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 64: SVG Icon Generation Producing Non-Sanitized Output Used in dangerouslySetInnerHTML
|
||||
|
||||
**What goes wrong:** An AI generates SVG markup for an icon (e.g., "generate a minimalist camera icon in SVG"). The generated SVG is stored as a string and rendered in React using `dangerouslySetInnerHTML={{ __html: svgContent }}`. A malicious or hallucinated SVG could contain `<script>` tags, `onclick` attributes, or `<use xlink:href="...">` references to external resources — causing XSS or data exfiltration.
|
||||
|
||||
**Why it happens:** SVG is XML with embedded scripting capability. AI-generated SVG is treated as trusted content because it originated from the system, not from a user. The trust boundary between "system-generated" and "safe" is incorrectly equated.
|
||||
|
||||
**How to avoid:**
|
||||
1. All SVG content — regardless of source — must be sanitized before rendering. Use DOMPurify with SVG-specific config: `FORCE_BODY: true, USE_PROFILES: { svg: true, svgFilters: true }`.
|
||||
2. For icon SVGs specifically: after sanitization, optimize with `svgo` to remove metadata, comments, and non-display elements. This also removes any scripting artifacts the sanitizer missed.
|
||||
3. Use `<img src="data:image/svg+xml;base64,...">` for displaying AI-generated icons rather than inline SVG. This prevents script execution entirely — the SVG is rendered as an image, not as DOM.
|
||||
4. Validate that the output is actually an SVG: check for `<svg` root element, valid namespace, and reasonable file size before storing.
|
||||
|
||||
**Warning signs:**
|
||||
- `dangerouslySetInnerHTML` used to render AI-generated SVG content
|
||||
- No sanitization step between AI output and SVG storage
|
||||
- SVG stored and served without Content-Security-Policy headers preventing script execution
|
||||
- No file size or structure validation on generated SVG
|
||||
|
||||
**Phase to address:** Phase 8 (Icon generation) — sanitization pipeline must be in place before any generated SVG reaches the DOM.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 65: Branding Media Kit Generation Treating All Assets as a Single Atomic Operation
|
||||
|
||||
**What goes wrong:** A branding media kit requires: logo (SVG), color palette, typography recommendation, banner images (5 sizes), social media templates (6 platforms), PDF one-pager, and icon set (24 icons). Implemented as a single agent task, the generation takes 15–45 minutes. If any single component fails (e.g., the PDF renderer crashes at step 7), the entire kit generation is abandoned with no partial output.
|
||||
|
||||
**Why it happens:** "Generate a brand kit" is naturally conceived as one task. The atomic approach matches how a human designer might present the deliverable — as a complete package. The failure mode only becomes apparent when the first long-running attempt is interrupted.
|
||||
|
||||
**How to avoid:**
|
||||
1. Decompose the brand kit into a parent task with sub-tasks per asset type, using Paperclip's existing `parentId` + `goalId` sub-task pattern.
|
||||
2. Each sub-task (logo generation, palette, PDFs, banners) runs independently and stores its output before the next sub-task begins.
|
||||
3. The parent task aggregates completed sub-task outputs into a final ZIP/manifest. It only moves to `done` when all sub-tasks complete.
|
||||
4. If a sub-task fails, it enters `blocked` state with an error comment — the other sub-tasks continue. The user sees partial progress rather than total failure.
|
||||
5. Use placeholder assets (Pitfall 61) for each sub-task to signal "this component is queued."
|
||||
|
||||
**Warning signs:**
|
||||
- All brand kit generation in a single agent run
|
||||
- No sub-task decomposition in the agent's plan
|
||||
- All-or-nothing completion: either full kit or nothing stored
|
||||
- No intermediate progress visible in the UI during kit generation
|
||||
|
||||
**Phase to address:** Phase 9 (Branding media kit) — task decomposition design must be specified before implementation. This is an agent orchestration design decision, not just a code change.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 66: Generated Assets Not Linked to Their Originating Task for Garbage Collection
|
||||
|
||||
**What goes wrong:** Content generation produces files: videos, PDFs, images, SVGs. These files accumulate in the storage directory. When a task is deleted or cancelled, its generated assets remain on disk because no relationship between the task and the generated files was established. Over months, the storage directory fills with orphaned files from cancelled, superseded, or test renders.
|
||||
|
||||
For a single-user deployment on a Mac Mini, disk space is finite. A few hundred video renders can consume 50–200GB of disk without the user being aware.
|
||||
|
||||
**Why it happens:** Generating the file and storing it is the primary flow. Cleanup is deferred as "we'll add it later." The relationship between task and asset is informal (mentioned in task comments) rather than machine-readable.
|
||||
|
||||
**How to avoid:**
|
||||
1. Every generated asset must be stored with a `sourceTaskId` (issue ID) and `sourceRunId` in its manifest record. This is a hard requirement, not an optional field.
|
||||
2. When a task is deleted or moved to `cancelled`, a cleanup job queries all assets with that `sourceTaskId` and queues them for deletion.
|
||||
3. Add a storage usage dashboard visible in the Nexus admin UI: total storage used, per-type breakdown (video, PDF, image), largest files.
|
||||
4. Set retention policies per content type: generated draft videos expire after 7 days unless explicitly pinned; final approved assets are retained indefinitely.
|
||||
5. The existing `storageService` already has `deleteObject` — wire it to the task lifecycle.
|
||||
|
||||
**Warning signs:**
|
||||
- Assets stored with no `sourceTaskId` field
|
||||
- Storage directory growing unboundedly over weeks
|
||||
- No delete path in the generated asset manifest
|
||||
- Task deletion not triggering asset cleanup
|
||||
|
||||
**Phase to address:** Phase 1 (Storage foundations) — the `sourceTaskId` manifest field must be present from the first generated asset stored, not added retroactively.
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt Patterns — v1.7
|
||||
|
||||
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|
||||
|----------|-------------------|----------------|-----------------|
|
||||
| HEX-only color storage in theme manifest | Simpler, universal format | OKLCH round-trip loss; palette corruption in downstream tools | Never — store OKLCH source values always |
|
||||
| `bundle()` per render request | No bundle cache management | 5-minute render startup; server unresponsive under load | Never |
|
||||
| HSL for palette generation | Familiar API surface | Perceptually incoherent palettes; design rejection | Never — use OKLCH via culori |
|
||||
| Puppeteer `launch()` per PDF request | No browser lifecycle management | 2–3s overhead per PDF; RAM spikes | Never for production; OK for CLI one-shot scripts |
|
||||
| All brand kit in one agent task | Simple orchestration | All-or-nothing failure; no partial recovery | MVP only if kit has <3 components |
|
||||
| `ctx.state` for generated file storage | Simplest persistence path | DB row size limits; performance degradation with binary data | Never — use objectKey reference only |
|
||||
| Global `MAX_ATTACHMENT_BYTES` bump | Quick fix for video storage | User-uploaded attachment limit also raised; security regression | Never — use separate generated assets namespace |
|
||||
|
||||
## Integration Gotchas — v1.7
|
||||
|
||||
| Integration | Common Mistake | Correct Approach |
|
||||
|-------------|----------------|------------------|
|
||||
| Instagram API | PNG images for feed posts | Convert all output to JPEG before posting |
|
||||
| Instagram API | 5,000 calls/hour assumed (pre-2025 rate) | Use 200 calls/hour budget; implement rate-limit queue |
|
||||
| Twitter/X media upload | Simple single-request upload | Always use INIT+APPEND+FINALIZE three-step chunked upload |
|
||||
| Remotion + pnpm | Adding `@remotion/renderer` to root workspace | Isolate in `packages/remotion-renderer/`; avoid lockfile cascade |
|
||||
| Mermaid server-side | Calling `mermaid.render()` in Node.js without DOM | Use `svgdom` + DOMPurify, or `mmdc` CLI child process |
|
||||
| Puppeteer fonts | Relying on system fonts in headless Chromium | Self-host all fonts; reference via localhost URL in templates |
|
||||
| Paperclip plugin SDK | Direct HTTP calls from plugin worker to host | Use `ctx.host.*` RPC bridge only |
|
||||
| WCAG calculation | WCAG 2.x spec's 0.03928 threshold | Use `culori`'s `wcagContrast()` with correct 0.04045 threshold |
|
||||
| OKLCH exports | HEX-only export from theme generator | Export HEX + OKLCH + CSS custom properties simultaneously |
|
||||
|
||||
## Performance Traps — v1.7
|
||||
|
||||
| Trap | Symptoms | Prevention | When It Breaks |
|
||||
|------|----------|------------|----------------|
|
||||
| Bundle-per-render | Render queue backed up; server unresponsive | Cache bundle at startup; `renderMedia()` only per request | First concurrent render request |
|
||||
| Chromium concurrency 100% | Memory pressure; render time 3–10x baseline | Set `concurrency: 4` on M4; benchmark with `npx remotion benchmark` | Second concurrent render on 16GB machine |
|
||||
| Model-per-request inference | 15s startup on every image generation call | Keep model in memory; semaphore for single-concurrent inference | First concurrent image generation request |
|
||||
| JSDOM DOMPurify accumulation | Slow diagram renders after 100+ requests | Periodic JSDOM window cleanup; or child process per sanitization batch | After ~200 diagram renders in one process lifetime |
|
||||
| Puppeteer launch-per-PDF | 2–3s overhead per PDF; RAM spikes | Persistent browser instance; `newPage()` per request | Third concurrent PDF request |
|
||||
| Unrestricted generated asset storage | Disk full after months of use | Per-type retention policies; `sourceTaskId` for cleanup | After ~100 video renders (50–200GB) |
|
||||
|
||||
## Security Mistakes — v1.7
|
||||
|
||||
| Mistake | Risk | Prevention |
|
||||
|---------|------|------------|
|
||||
| Mermaid `securityLevel: "loose"` | XSS → RCE via AI-generated click directives | Always `"strict"`; strip `%%{init}%%` pre-render; DOMPurify post-render |
|
||||
| AI-generated SVG via `dangerouslySetInnerHTML` | XSS via script/event injection in SVG | DOMPurify with SVG profile; prefer `<img>` over inline SVG for AI output |
|
||||
| JSDOM version < 20 with DOMPurify | XSS bypass via known JSDOM 19 attack vectors | Pin JSDOM ≥ 20 |
|
||||
| Plugin worker direct HTTP to host API | Capability bypass; audit trail gaps | Enforce JSON-RPC bridge only; no `fetch()` to localhost in plugins |
|
||||
| Generated asset served without content-type validation | Browser interprets SVG as executable HTML | Always set explicit `Content-Type` header from manifest; never infer from file extension |
|
||||
| Social media API credentials in generated content skill | Token exposure via plugin state leak | Store API credentials in Nexus server config; inject via `ctx.host.secrets.*` RPC |
|
||||
|
||||
## "Looks Done But Isn't" Checklist — v1.7
|
||||
|
||||
- [ ] **Remotion render:** `onProgress` callback wired to SSE; render job manifest exists before render starts; output path is deterministic (not temp); job status tracked to completion.
|
||||
- [ ] **Remotion bundle:** `bundle()` called once at startup, result cached; never called per request; entry point validated at startup.
|
||||
- [ ] **Mermaid rendering:** `securityLevel: "strict"` set; `%%{init}%%` directives stripped; DOMPurify applied to output SVG; JSDOM ≥ 20.
|
||||
- [ ] **PDF generation:** fonts self-hosted via localhost URL; Puppeteer browser instance persistent (not per-request); `waitForNetworkIdle()` before `page.pdf()`.
|
||||
- [ ] **Theme generator:** OKLCH used for all computation; WCAG calculation uses `culori.wcagContrast()`; export includes OKLCH format alongside HEX.
|
||||
- [ ] **Color palette:** `culori` library used (not HSL manipulation); perceptual uniformity validated by visual inspection; OKLCH L/C/H values stored in manifest.
|
||||
- [ ] **Storage limits:** generated assets use separate namespace with raised limits; `MAX_ATTACHMENT_BYTES` unchanged; `video/mp4` only allowed on generated assets route; `sourceTaskId` present on all generated assets.
|
||||
- [ ] **Image generation:** model loaded once (not per request); inference semaphore in place; memory pressure logged before inference start.
|
||||
- [ ] **Social media:** platform specs table defined as code; JPEG conversion applied before Instagram posts; chunked upload used for Twitter/X; rate limit queue implemented.
|
||||
- [ ] **Content skills:** all host communication via JSON-RPC bridge (no `fetch()` to localhost); `ctx.state` contains only metadata (objectKey, not binary content); `capabilities` array reviewed and minimal.
|
||||
- [ ] **SVG icons:** DOMPurify + svgo applied to all AI-generated SVG; rendered as `<img>` not inline DOM where possible.
|
||||
- [ ] **Brand kit:** decomposed into sub-tasks; each sub-task has its own output manifest; parent task only `done` when all sub-tasks complete.
|
||||
- [ ] **Asset lifecycle:** all generated assets have `sourceTaskId`; task cancellation triggers asset cleanup query.
|
||||
- [ ] **Placeholder assets:** `isDraft: true` flag in manifest; visible DRAFT watermark in file content; UI shows DRAFT badge.
|
||||
|
||||
## Pitfall-to-Phase Mapping — v1.7
|
||||
|
||||
| Pitfall | Prevention Phase | Verification |
|
||||
|---------|------------------|--------------|
|
||||
| bundle() per render (45) | Phase 1 — Remotion foundation | Server logs show "bundle cached" at startup; no Webpack compilation in render request logs |
|
||||
| Chromium concurrency thrashing (46) | Phase 1 — Remotion foundation | `concurrency: 4` in render config; Activity Monitor RAM stays below 12GB during render |
|
||||
| bundle() in compiled server context (47) | Phase 1 — Remotion foundation | `pnpm build && pnpm start` with render request succeeds; no path resolution errors |
|
||||
| 10MB file size limit (48) | Phase 1 — Storage foundations | Store a 50MB test file via generated assets route; HTTP 200 returned |
|
||||
| Mermaid XSS via securityLevel loose (49) | Phase 3 — Mermaid generation | `securityLevel: "strict"` in code review; penetration test with `%%{init:{"securityLevel":"loose"}}%%` input |
|
||||
| DOMPurify JSDOM memory accumulation (50) | Phase 3 — Mermaid generation | Load test: 200 diagram renders; server heap stays flat |
|
||||
| HSL palette incoherence (51) | Phase 4 — Theme generator | No HSL in palette generation code; visual review of 10 generated palettes |
|
||||
| WCAG incorrect linearization (52) | Phase 4 — Theme generator | Cross-validate 5 color pairs against WebAIM checker; results match |
|
||||
| PDF font loading failures (53) | Phase 5 — PDF generation | Generate PDF via launchd service process; font matches dev environment |
|
||||
| Puppeteer per-request launch (54) | Phase 5 — PDF generation | Browser process count stays at 1 during 10 concurrent PDF requests |
|
||||
| No render progress reporting (55) | Phase 1 — Remotion foundation | UI shows progress bar during render; SSE events visible in browser DevTools |
|
||||
| Social media constraints ignored (56) | Phase 6 — Social media skill | Platform specs table exists as typed constant; Instagram posts JPEG; Twitter uses chunked upload |
|
||||
| Plugin capability bypass (57) | Phase 2 — Content skills architecture | No `fetch('http://localhost')` in any plugin worker file (grep check) |
|
||||
| Image model per-request (58) | Phase 7 — Image generation | "Loading model" log line appears once at startup; semaphore visible in code |
|
||||
| Mermaid DOM in Node.js (59) | Phase 3 — Mermaid generation | Server-side render test produces valid SVG; no "document is not defined" error |
|
||||
| Heartbeat timeout for renders (60) | Phase 1 — Remotion foundation | Agent starts render and exits heartbeat; task still in_progress; completion fires via polling routine |
|
||||
| Placeholder without DRAFT indicator (61) | Phase 2 — Placeholder assets | Placeholder image contains visible DRAFT watermark; manifest has isDraft:true |
|
||||
| Theme HEX-only export (62) | Phase 4 — Theme generator | Export JSON contains oklch field; CSS export uses oklch() syntax |
|
||||
| pnpm lockfile merge conflicts (63) | Phase 1 — Remotion foundation | Remotion in isolated sub-package; post-rebase `pnpm install` succeeds without manual lockfile edit |
|
||||
| SVG icon XSS (64) | Phase 8 — Icon generation | DOMPurify + svgo applied; icons rendered as `<img>` not inline SVG |
|
||||
| Brand kit atomic failure (65) | Phase 9 — Branding media kit | Kit generation uses sub-tasks; partial completion visible if one sub-task fails |
|
||||
| Generated assets without cleanup (66) | Phase 1 — Storage foundations | All stored assets have sourceTaskId; task deletion query confirms cleanup |
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
|
@ -1106,6 +1718,10 @@ ffmpeg is not installed by default on macOS. It is available via Homebrew (`brew
|
|||
- `/opt/nexus/ui/vite.config.ts` — OnboardingWizard Vite alias pattern
|
||||
- `/opt/nexus/ui/src/components/VoiceRecordButton.tsx` — existing Whisper STT implementation
|
||||
- `/opt/nexus/ui/src/adapters/registry.ts` — adapter registration pattern
|
||||
- `/opt/nexus/server/src/attachment-types.ts` — MAX_ATTACHMENT_BYTES=10MB default; DEFAULT_ALLOWED_TYPES excludes video/* (HIGH confidence — direct read)
|
||||
- `/opt/nexus/server/src/storage/local-disk-provider.ts` — local disk storage: no built-in size limits, atomic write via rename (HIGH confidence — direct read)
|
||||
- `/opt/nexus/server/src/services/plugin-worker-manager.ts` — one worker per plugin, crash recovery backoff, 30s RPC timeout (HIGH confidence — direct read)
|
||||
- `/opt/nexus/server/src/services/plugin-state-store.ts` — plugin state is scoped key-value JSON in DB; not designed for binary blobs (HIGH confidence — direct read)
|
||||
|
||||
**Research (MEDIUM confidence unless noted):**
|
||||
- [Puter.js Free Unlimited AI API](https://developer.puter.com/tutorials/free-unlimited-ai-api/) — Puter is browser-SDK-first; server-side HTTP integration requires manual HTTP calls
|
||||
|
|
@ -1130,6 +1746,26 @@ ffmpeg is not installed by default on macOS. It is available via Homebrew (`brew
|
|||
- [Node.js child_process binary PATH issue](https://github.com/nodejs/help/issues/163) — service environment PATH differs from interactive shell; use absolute paths (HIGH confidence)
|
||||
- [Whisper memory leak](https://github.com/openai/whisper/discussions/605) — RAM not fully released after transcription in some environments (MEDIUM confidence)
|
||||
|
||||
**v1.7 Research (MEDIUM confidence unless noted):**
|
||||
- [Remotion bundle() anti-pattern — official docs](https://www.remotion.dev/docs/bundle) — calling bundle() per render is documented anti-pattern; bundle once, renderMedia() per job (HIGH confidence — official docs)
|
||||
- [Remotion compare-ssr options](https://www.remotion.dev/docs/compare-ssr) — custom server requires managing queuing, progress, error handling; CPU-only on self-hosted (HIGH confidence — official docs)
|
||||
- [Remotion concurrency issue #4300](https://github.com/remotion-dev/remotion/issues/4300) — 100% concurrency with limited Docker CPU causes thrashing; npx remotion benchmark recommended (MEDIUM confidence)
|
||||
- [Remotion Chromium headless memory leak](https://www.remotion.dev/docs/gpu) — angle GL backend memory leak in v2.4.3–2.6.6; use swangle (HIGH confidence — official changelog)
|
||||
- [Mermaid XSS via securityLevel:loose — OneUptime advisory](https://github.com/OneUptime/oneuptime/security/advisories/GHSA-wvh5-6vjm-23qh) — stored XSS via click directive (HIGH confidence — official CVE)
|
||||
- [Mermaid XSS RCE in DeepChat — CVE-2025-67744](https://github.com/ThinkInAIXYZ/deepchat/security/advisories/GHSA-f7q5-vc93-wp6j) — XSS to RCE via Electron IPC (HIGH confidence — official advisory)
|
||||
- [beautiful-mermaid SVG attribute injection — CVE-2026-26226](https://advisories.gitlab.com/pkg/npm/beautiful-mermaid/CVE-2026-26226/) — SVG attribute injection without %%{init}%% (HIGH confidence — GitLab advisory)
|
||||
- [DOMPurify server-side JSDOM requirements](https://github.com/cure53/DOMPurify) — JSDOM ≥ 20 required; happy-dom unsafe; memory accumulation in long-running processes (HIGH confidence — official repo docs)
|
||||
- [Mermaid server-side SVG rendering issue #6634](https://github.com/mermaid-js/mermaid/issues/6634) — JSDOM limitations for foreignObject; svgdom preferred (MEDIUM confidence)
|
||||
- [OKLCH for palette generation — Evil Martians](https://evilmartians.com/chronicles/oklch-in-css-why-quit-rgb-hsl) — HSL perceptual non-uniformity; OKLCH superior for palette generation (HIGH confidence — widely cited)
|
||||
- [Tailwind CSS 4.0 adopts OKLCH](https://blog.simon-hu.org/posts/2025/08---auguest/2025-08-04-css-color-options/) — Tailwind 4 uses OKLCH natively (MEDIUM confidence)
|
||||
- [WCAG contrast linearization formula — W3C](https://www.w3.org/WAI/GL/wiki/Relative_luminance) — 0.04045 is correct threshold; 0.03928 in WCAG 2.x spec is erroneous (HIGH confidence — W3C official wiki)
|
||||
- [Puppeteer PDF font pitfalls — Joyfill](https://joyfill.io/blog/integrating-pdf-generation-into-node-js-backends-tips-gotchas) — system vs headless font paths differ; self-host fonts; lock versions (MEDIUM confidence)
|
||||
- [Never use Puppeteer for PDFs on server — Medium](https://medium.com/@cristian.rosas/never-use-pupeeter-to-create-nodejs-pdfs-on-the-server-recommendation-5cc3a884eba7) — resource-intensive; cold start latency; recommend persistent instance (MEDIUM confidence)
|
||||
- [Social media API rules and rate limits 2026 — Postproxy](https://postproxy.dev/blog/social-media-platform-api-rules-rate-limits-media-specs/) — Instagram 200 calls/hour (down from 5,000); Twitter chunked upload; platform MIME constraints (MEDIUM confidence)
|
||||
- [Instagram PNG rejection / JPEG-only requirement](https://iformat.io/blog/social-media-image-size-guide-2026-all-platforms) — JPEG only for feed posts confirmed (MEDIUM confidence)
|
||||
- [Paperclip plugin SDK capability model — official spec](https://github.com/paperclipai/paperclip/blob/master/doc/plugins/PLUGIN_SPEC.md) — all Worker-to-Host calls gated by manifest.capabilities; plugin bundles must not import from host internals (HIGH confidence — official spec)
|
||||
- [pnpm lockfile merge conflicts — pnpm discussion #4324](https://github.com/orgs/pnpm/discussions/4324) — large dependency additions produce large lockfile diffs; run pnpm install after resolving package.json conflicts (MEDIUM confidence)
|
||||
|
||||
---
|
||||
*Pitfalls research for: Nexus v1.5 — Smart Onboarding + Personal AI Assistant; v1.6 — Voice Pipeline + Telegram Bridge*
|
||||
*Pitfalls research for: Nexus v1.5 — Smart Onboarding + Personal AI Assistant; v1.6 — Voice Pipeline + Telegram Bridge; v1.7 — Content Generation Layer (Remotion, image gen, Mermaid, PDF, theme gen, social media, content skills, large file storage)*
|
||||
*Researched: 2026-04-02; Updated: 2026-04-03*
|
||||
|
|
|
|||
|
|
@ -1,223 +1,281 @@
|
|||
# Technology Stack: v1.6 Voice Pipeline + Telegram Bridge
|
||||
# Technology Stack: v1.7 Content Generation
|
||||
|
||||
**Project:** Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client)
|
||||
**Researched:** 2026-04-03
|
||||
**Scope:** NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge
|
||||
**Confidence:** MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified)
|
||||
**Project:** Nexus v1.7 — additive to v1.6 stack (see prior STACK.md for voice pipeline, Telegram bridge, ffmpeg-static, grammy, @ricky0123/vad-react)
|
||||
**Researched:** 2026-04-04
|
||||
**Scope:** NEW libraries only for v1.7 — presentations/video, image generation, diagram rendering, PDF generation, SVG icons, color/theme tools, social media assets
|
||||
**Confidence:** MEDIUM-HIGH (Remotion HIGH via official docs; satori+resvg-js HIGH via official repos; mermaid already installed; ComfyUI client MEDIUM; culori MEDIUM via comparison sources)
|
||||
|
||||
---
|
||||
|
||||
## Context: What v1.5 Already Installed
|
||||
## Context: What Is Already Installed
|
||||
|
||||
Do not re-add or re-research these — they are in `server/package.json` or `ui/package.json`:
|
||||
Do not re-add or re-research these — confirmed present in `server/package.json` or `ui/package.json`:
|
||||
|
||||
| Package | Location | Purpose |
|
||||
|---------|----------|---------|
|
||||
| `smart-whisper ^0.8.1` | `server/` | Whisper.cpp Node bindings (recommended in v1.5 STACK.md) |
|
||||
| `@mintplex-labs/piper-tts-web ^1.0.4` | `ui/` | Browser-side Piper WASM (already installed) |
|
||||
| `systeminformation 5` | `server/` | Hardware detection |
|
||||
| `multer ^2.0.2` | `server/` | Multipart upload (already handles audio blob uploads) |
|
||||
| `express ^5.1.0` | `server/` | HTTP server |
|
||||
| Package | Location | Version | Relevant To v1.7 |
|
||||
|---------|----------|---------|-----------------|
|
||||
| `sharp ^0.34.5` | `server/` | 0.34.5 | SVG→PNG conversion, image compositing, social asset output |
|
||||
| `mermaid ^11.12.0` | `ui/` | 11.12.0 | Client-side diagram rendering already works |
|
||||
| `ffmpeg-static ^5.3.0` | `server/` | 5.3.0 | Video stitching, audio for Remotion renders |
|
||||
| `zod ^3.24.2` | `server/` | 3.24.2 | Schema validation for content generation requests |
|
||||
| `express ^5.1.0` | `server/` | 5.1.0 | API endpoints for all content generation routes |
|
||||
|
||||
The existing `VoiceRecordButton` already uses `MediaRecorder` + `POST /api/transcribe`. The existing `usePiperTts` hook already uses `@mintplex-labs/piper-tts-web` for browser-side TTS. The v1.6 work **extends** this — adding silence detection, server-side TTS, and Telegram relay.
|
||||
The v1.7 work **adds** new rendering capabilities — it does not replace or duplicate any of the above.
|
||||
|
||||
---
|
||||
|
||||
## New Libraries by Feature Area
|
||||
|
||||
### 1. Browser VAD (Silence Detection + Auto-Send)
|
||||
### 1. Presentations & Video Generation
|
||||
|
||||
**Package:** `@ricky0123/vad-react`
|
||||
**Version:** `^0.0.36`
|
||||
**Where it lives:** `ui/` only — browser-side ONNX model running off the main thread
|
||||
**Packages:** `remotion`, `@remotion/bundler`, `@remotion/renderer`
|
||||
**Version:** `^4.0.443` (latest — all Remotion packages must share the same version)
|
||||
**Where it lives:** New workspace package `packages/content-renderer/` (isolated — Remotion uses its own webpack pipeline)
|
||||
|
||||
**Why:** The existing `VoiceRecordButton` requires the user to manually tap Stop. `@ricky0123/vad-react` uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires `onSpeechEnd` automatically with the speech segment as a `Float32Array` at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the `userSpeaking` state flag.
|
||||
**Why Remotion:**
|
||||
- React-native: slides/presentations are React components — agents generate TSX, Remotion renders to MP4 or PNG sequences
|
||||
- SSR API (`@remotion/renderer`) works in Node.js without a browser UI; the Express server calls it directly
|
||||
- Remotion bundles its own Chromium headless shell — no separate Chrome install on Mac Mini
|
||||
- `ffmpeg-static` is already installed and Remotion detects it automatically (`ensureFfmpeg()`)
|
||||
- Official Express render-server template exists: `remotion-dev/template-render-server` — proven integration pattern
|
||||
- Mac M4/Apple Silicon: Remotion downloads a macOS arm64 Chromium binary; confirmed working
|
||||
|
||||
**React 19 compatibility:** Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No `--legacy-peer-deps` needed.
|
||||
**Why NOT Remotion Lambda/Cloud:** This is a single-user Mac Mini deployment. No serverless. Local rendering only.
|
||||
|
||||
**API surface:**
|
||||
|
||||
```typescript
|
||||
import { useMicVAD } from "@ricky0123/vad-react";
|
||||
|
||||
const vad = useMicVAD({
|
||||
startOnLoad: false, // user must explicitly start
|
||||
positiveSpeechThreshold: 0.3, // sensitivity
|
||||
minSpeechMs: 400, // ignore sub-400ms blips
|
||||
redemptionMs: 1400, // 1.4s silence = end of utterance
|
||||
onSpeechEnd: (audio: Float32Array) => {
|
||||
// audio is 16kHz Float32Array — matches what Whisper expects
|
||||
sendToTranscribeEndpoint(float32ToWav(audio));
|
||||
},
|
||||
});
|
||||
|
||||
// vad.userSpeaking — boolean for waveform animation
|
||||
// vad.listening — boolean for mic state
|
||||
// vad.start() / vad.pause()
|
||||
**Rendering flow:**
|
||||
```
|
||||
Agent generates TSX composition
|
||||
→ POST /api/content/render-video
|
||||
→ @remotion/bundler: bundle(entryPoint) → tmpDir
|
||||
→ @remotion/renderer: renderMedia({ composition, serveUrl, outputLocation })
|
||||
→ returns MP4 path in Nexus file system
|
||||
```
|
||||
|
||||
**Key integration note:** `onSpeechEnd` delivers a `Float32Array` at 16000Hz — this maps directly to what `smart-whisper` expects on the server side, so no resampling is needed in the browser-to-server path.
|
||||
**License note:** Remotion Skills (Claude Code integration) requires a commercial license for companies ≥4 employees. Nexus is single-user personal use — free tier applies.
|
||||
|
||||
**Confidence: MEDIUM** — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load.
|
||||
**Confidence: HIGH** — Official SSR docs verified at remotion.dev/docs/ssr. Express template confirmed at github.com/remotion-dev/template-render-server.
|
||||
|
||||
---
|
||||
|
||||
### 2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG)
|
||||
### 2. Diagram Generation (Mermaid → SVG/PNG, Server-Side)
|
||||
|
||||
**Package:** `ffmpeg-static`
|
||||
**Version:** `^5.2.0` (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows)
|
||||
**Where it lives:** `server/` — provides the binary path; invoked via Node.js `child_process.spawn`
|
||||
**Package:** `@mermaid-js/mermaid-cli`
|
||||
**Version:** `^11.12.0`
|
||||
**Where it lives:** `server/` — invoked as a programmatic API, not a CLI subprocess
|
||||
|
||||
**Why `ffmpeg-static` over alternatives:**
|
||||
- `fluent-ffmpeg` was archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency
|
||||
- `@ffmpeg-installer/ffmpeg` — last updated 2022, stale binary (FFmpeg 4.x)
|
||||
- `ffmpeg-static` — actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed)
|
||||
- Direct `child_process.spawn("ffmpeg", [...])` with the binary path from `ffmpeg-static` is the recommended approach for 2025+
|
||||
|
||||
**Two conversions needed:**
|
||||
|
||||
**a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)**
|
||||
**Why:** `mermaid ^11.12.0` is already installed in `ui/` for client-side rendering. For server-side rendering (agent → stored SVG/PNG file), the CLI package exposes a Node.js API:
|
||||
|
||||
```typescript
|
||||
import ffmpegPath from "ffmpeg-static";
|
||||
import { spawn } from "node:child_process";
|
||||
import { run } from "@mermaid-js/mermaid-cli";
|
||||
|
||||
function webmToWav16k(inputBuffer: Buffer): Promise<Buffer> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const proc = spawn(ffmpegPath!, [
|
||||
"-i", "pipe:0", // read from stdin
|
||||
"-acodec", "pcm_s16le",
|
||||
"-ac", "1", // mono
|
||||
"-ar", "16000", // 16kHz
|
||||
"-f", "wav",
|
||||
"pipe:1", // write to stdout
|
||||
]);
|
||||
const out: Buffer[] = [];
|
||||
proc.stdout.on("data", (c: Buffer) => out.push(c));
|
||||
proc.stdout.on("end", () => resolve(Buffer.concat(out)));
|
||||
proc.stderr.on("data", () => {}); // suppress ffmpeg banner
|
||||
proc.on("error", reject);
|
||||
proc.stdin.write(inputBuffer);
|
||||
proc.stdin.end();
|
||||
});
|
||||
await run(inputMmdFile, outputSvgFile, { outputFormat: "svg" });
|
||||
```
|
||||
|
||||
**Why NOT headless-mermaid or puppeteer-based alternatives:**
|
||||
- `headless-mermaid` is unmaintained (last update 2022)
|
||||
- `@mermaid-js/mermaid-cli` is the official tool, maintained by the mermaid-js org, same version as the `mermaid` npm package
|
||||
- It uses puppeteer internally but packages everything — no additional Chrome install needed
|
||||
|
||||
**Important:** `@mermaid-js/mermaid-cli` installs its own puppeteer with bundled Chromium. On Mac M4, this is a separate ~300MB download from Remotion's Chromium. Consider: if Remotion is already installed and its Chromium binary is available, pipe mermaid rendering through `@remotion/renderer` instead to share the Chromium binary. This is an implementation optimization, not a blocker.
|
||||
|
||||
**Confidence: MEDIUM** — Official npm package, same version as mermaid. Programmatic API (`run()`) confirmed in README.
|
||||
|
||||
---
|
||||
|
||||
### 3. PDF Generation (Reports, Invoices)
|
||||
|
||||
**Package:** `playwright-chromium`
|
||||
**Version:** `^1.50.0` (do NOT use `^1.59.1` from root devDependencies — that's for e2e tests; install a separate `playwright-chromium` in `server/`)
|
||||
**Where it lives:** `server/`
|
||||
|
||||
**Why Playwright over Puppeteer:**
|
||||
- 2026 benchmark (macOS arm64, Node 22): Playwright is 42ms cold vs Puppeteer's 147ms cold, 3ms warm vs 48ms warm
|
||||
- `playwright-chromium` installs only the Chromium binary — no Firefox or WebKit overhead
|
||||
- TypeScript-native, no `@types/playwright` needed
|
||||
- Actively maintained by Microsoft; Puppeteer-core is maintained by Google but Playwright has overtaken it for new projects as of 2025-2026
|
||||
|
||||
**Why NOT puppeteer-core:**
|
||||
- Slower at every data point in 2026 benchmarks
|
||||
- `playwright-chromium` provides an equally thin package (Chromium only) at better performance
|
||||
|
||||
**Why NOT @remotion/renderer for PDFs:**
|
||||
- Remotion is optimized for video frame rendering, not document layout
|
||||
- Playwright renders HTML/CSS directly via Chrome's print-to-PDF — correct page breaks, headers, footers, print stylesheet support
|
||||
|
||||
**PDF generation pattern:**
|
||||
```typescript
|
||||
import { chromium } from "playwright-chromium";
|
||||
|
||||
const browser = await chromium.launch({ headless: true });
|
||||
const page = await browser.newPage();
|
||||
await page.setContent(htmlString, { waitUntil: "networkidle" });
|
||||
const pdf = await page.pdf({ format: "A4", printBackground: true });
|
||||
await browser.close();
|
||||
return pdf; // Buffer
|
||||
```
|
||||
|
||||
**Mac M4 note:** `playwright-chromium` downloads a macOS arm64 Chromium binary via `npx playwright install chromium`. This is yet another Chromium binary. Consider sharing with Remotion's binary — implementation detail, investigate during Phase execution.
|
||||
|
||||
**Confidence: HIGH** — Playwright PDF generation confirmed at pptr.dev equivalent, benchmark sourced from pdf4.dev/blog/html-to-pdf-benchmark-2026 (March 2026, macOS arm64 run).
|
||||
|
||||
---
|
||||
|
||||
### 4. Social Media Images & OG Images (Satori + resvg-js)
|
||||
|
||||
**Packages:** `satori`, `resvg-js`
|
||||
**Versions:** `satori ^0.26.0`, `resvg-js ^2.6.2`
|
||||
**Where it lives:** `server/`
|
||||
|
||||
**Why satori:**
|
||||
- Converts React JSX (HTML/CSS subset) to SVG — no browser needed, pure Node.js
|
||||
- Used by `@vercel/og` internally — battle-tested for exactly this use case (OG images, social cards)
|
||||
- Agents generate JSX layouts describing the social post design; satori renders to SVG
|
||||
- Supports TTF/OTF/WOFF fonts — can embed the Nexus brand font
|
||||
|
||||
**Why resvg-js (paired with satori):**
|
||||
- Converts the SVG output from satori → PNG via Rust bindings (napi-rs)
|
||||
- No headless browser needed — pure native Node.js module
|
||||
- `sharp ^0.34.5` is already installed and can do SVG→PNG too, but resvg-js handles text rendering more accurately when the SVG uses custom fonts embedded by satori
|
||||
|
||||
**When to use sharp vs resvg-js:**
|
||||
- `sharp`: image compositing, resizing, format conversion (WebP, AVIF), photo manipulation
|
||||
- `resvg-js`: satori SVG → PNG with correct font rendering
|
||||
|
||||
**Social media output flow:**
|
||||
```
|
||||
Agent generates JSX layout + text
|
||||
→ satori(jsx, { width: 1200, height: 630, fonts: [...] }) → SVG string
|
||||
→ resvg-js Resvg(svgString).render().asPng() → PNG Buffer
|
||||
→ sharp(pngBuffer).resize(width, height).webp().toBuffer() → final asset
|
||||
```
|
||||
|
||||
**Platform dimensions (built-in config):**
|
||||
| Platform | Size |
|
||||
|----------|------|
|
||||
| OpenGraph | 1200×630 |
|
||||
| Twitter/X card | 1200×628 |
|
||||
| Instagram square | 1080×1080 |
|
||||
| Instagram story | 1080×1920 |
|
||||
| LinkedIn post | 1200×627 |
|
||||
|
||||
**Confidence: HIGH** — satori GitHub (vercel/satori) reviewed, resvg-js GitHub (thx/resvg-js) reviewed. Pattern confirmed in multiple tutorials. Versions confirmed via npm registry.
|
||||
|
||||
---
|
||||
|
||||
### 5. SVG Icon Generation
|
||||
|
||||
**No new library needed.** The existing stack already covers this:
|
||||
|
||||
- `sharp ^0.34.5`: can rasterize SVG to PNG at any resolution
|
||||
- `resvg-js ^2.6.2` (added above): accurate SVG rendering with custom fonts
|
||||
- Template-based SVG generation: agents produce SVG strings using a template library or string composition
|
||||
|
||||
**Approach:** Agents generate SVG markup directly (simple geometric shapes, paths, text). The server validates the SVG, stores it in the file system, and optionally exports PNG variants via `sharp`. No additional icon library needed.
|
||||
|
||||
**Why NOT `svg.js` or `svg-builder`:** These are authoring libraries for complex interactive SVG manipulation in the browser. For agent-generated icons, the agent produces the SVG string — the server only needs to validate and render it.
|
||||
|
||||
**Confidence: HIGH** — this is an implementation decision, not a library gap.
|
||||
|
||||
---
|
||||
|
||||
### 6. Theme & Palette Generator with WCAG Contrast
|
||||
|
||||
**Package:** `culori`
|
||||
**Version:** `^4.0.2`
|
||||
**Where it lives:** `server/` and optionally `ui/` (tree-shakeable ESM)
|
||||
|
||||
**Why culori over chroma-js:**
|
||||
- 2026 community consensus: "OKLCH is the future of CSS color — use culori for any design-system work"
|
||||
- culori supports the OKLCH color space natively; chroma-js 3.x has limited OKLCH support
|
||||
- WCAG contrast calculation in culori is more accurate (proper relative luminance implementation)
|
||||
- culori is fully ESM + CJS with tree-shaking; chroma-js 3.x is CJS-first with ESM wrapper
|
||||
- Both support WCAG contrast ratio, but culori's implementation is described as "most accurate" in 2026 comparisons
|
||||
|
||||
**Why NOT tinycolor2:** Minimal WCAG support, no OKLCH, not maintained for design-system work.
|
||||
|
||||
**WCAG AA validation pattern:**
|
||||
```typescript
|
||||
import { wcagContrast, oklch, formatHex } from "culori";
|
||||
|
||||
function meetsWcagAA(fg: string, bg: string): boolean {
|
||||
return wcagContrast(fg, bg) >= 4.5; // AA for normal text
|
||||
}
|
||||
|
||||
function generateAccessiblePalette(baseHex: string) {
|
||||
const base = oklch(baseHex);
|
||||
// Generate shades by adjusting lightness in OKLCH space
|
||||
return [0.95, 0.85, 0.70, 0.55, 0.40, 0.25, 0.15].map(l => ({
|
||||
hex: formatHex({ ...base, l }),
|
||||
contrast: wcagContrast(formatHex({ ...base, l }), "#ffffff"),
|
||||
}));
|
||||
}
|
||||
```
|
||||
|
||||
**b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)**
|
||||
**Confidence: MEDIUM** — culori version verified via npm (4.0.2). WCAG accuracy claim sourced from pkgpulse.com comparison article (2026). Official culori docs reviewed.
|
||||
|
||||
---
|
||||
|
||||
### 7. Local Image Generation (Stable Diffusion / Flux)
|
||||
|
||||
**Package:** `@stable-canvas/comfyui-client`
|
||||
**Version:** `^1.5.9`
|
||||
**Where it lives:** `server/` — optional, only used when ComfyUI is detected running locally
|
||||
|
||||
**Why ComfyUI + this client:**
|
||||
- ComfyUI is the dominant local Stable Diffusion frontend in 2026 (alongside AUTOMATIC1111/Forge)
|
||||
- It exposes a REST + WebSocket API that this client wraps with TypeScript types
|
||||
- On Mac M4 (Apple Silicon), ComfyUI runs via Metal/MPS backend — confirmed working with Flux.1 models
|
||||
- The client is zero-dependency, MIT licensed, supports both REST (sync) and WebSocket (streaming progress) APIs
|
||||
|
||||
**Why NOT direct Stable Diffusion Python bindings:**
|
||||
- Mac M4 has no CUDA; Python SD libraries that require CUDA don't work
|
||||
- ComfyUI abstracts the backend (Metal/MPS on Mac) behind a standard REST API
|
||||
- Nexus agents call the REST API — they don't care about the backend hardware
|
||||
|
||||
**Architecture:** ComfyUI runs as a separate process (not managed by Nexus). Nexus detects it via health check on `http://localhost:8188` at startup. If not running, image generation features degrade gracefully with a "ComfyUI not available" message.
|
||||
|
||||
**Integration pattern:**
|
||||
```typescript
|
||||
function wavToOggOpus(inputBuffer: Buffer): Promise<Buffer> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const proc = spawn(ffmpegPath!, [
|
||||
"-i", "pipe:0",
|
||||
"-c:a", "libopus",
|
||||
"-b:a", "32k",
|
||||
"-f", "ogg",
|
||||
"pipe:1",
|
||||
]);
|
||||
// ... same pattern as above
|
||||
});
|
||||
}
|
||||
```
|
||||
import { Client } from "@stable-canvas/comfyui-client";
|
||||
|
||||
**Confidence: MEDIUM** — `ffmpeg-static` macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025.
|
||||
const client = new Client({ api_host: "localhost:8188" });
|
||||
await client.init(); // WebSocket handshake
|
||||
|
||||
---
|
||||
|
||||
### 3. Telegram Bridge
|
||||
|
||||
**Package:** `grammy`
|
||||
**Version:** `^1.41.1` (latest, supports Bot API 9.6)
|
||||
**Where it lives:** `server/` as an optional singleton service — only starts if `TELEGRAM_BOT_TOKEN` is set
|
||||
|
||||
**Why grammy over alternatives:**
|
||||
- `grammy` has 1.4M weekly downloads vs `telegraf` at 900K — grammY is now the higher-adoption choice
|
||||
- grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs
|
||||
- `node-telegram-bot-api` is lower-level with no middleware, requires more boilerplate for this use case
|
||||
- grammY's file handling API (`ctx.getFile()`) is the cleanest for the voice relay use case
|
||||
|
||||
**What the bridge needs to do (thin relay only — per PROJECT.md):**
|
||||
|
||||
```typescript
|
||||
import { Bot, Context } from "grammy";
|
||||
|
||||
const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!);
|
||||
|
||||
// Relay text messages to Nexus chat API
|
||||
bot.on("message:text", async (ctx) => {
|
||||
const response = await relayToNexus(ctx.message.text, ctx.from.id);
|
||||
await ctx.reply(response);
|
||||
const result = await client.enqueue(workflowJson, {
|
||||
progress: (p) => sendSSEProgress(p),
|
||||
});
|
||||
|
||||
// Receive voice messages — download OGG, transcribe, relay
|
||||
bot.on("message:voice", async (ctx) => {
|
||||
const file = await ctx.getFile();
|
||||
// file.download() returns Buffer (grammY handles temp URL expiry)
|
||||
const oggBuffer = await downloadFile(file.file_path!, bot.token);
|
||||
const transcript = await transcribeOgg(oggBuffer); // via smart-whisper
|
||||
const response = await relayToNexus(transcript, ctx.from.id);
|
||||
await ctx.reply(response);
|
||||
});
|
||||
|
||||
// Run with long polling (no webhook needed for single-user local setup)
|
||||
bot.start();
|
||||
// result.images[0] is a Buffer
|
||||
```
|
||||
|
||||
**Voice message format from Telegram:** Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with `ffmpeg-static` pipeline: `ogg→wav16k`.
|
||||
|
||||
**To send TTS back to Telegram:** Convert Piper WAV output → OGG Opus via `ffmpeg-static`, then use `ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg"))`.
|
||||
|
||||
**Long polling vs webhook:** Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed.
|
||||
|
||||
**Confidence: HIGH** — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge.
|
||||
|
||||
---
|
||||
|
||||
### 4. Server-Side Piper TTS (Audio Response Endpoint)
|
||||
|
||||
**No new library needed.** The v1.5 STACK.md already specified the `child_process.spawn` approach with the Piper binary.
|
||||
|
||||
**What v1.6 adds on top of v1.5:**
|
||||
- A new Express route: `POST /api/voice/synthesize` that accepts `{ text, voice? }` and returns raw WAV audio (`Content-Type: audio/wav`)
|
||||
- This endpoint is used by both the web chat playback (browser `<audio>` element) and the Telegram bridge (convert WAV → OGG for `sendVoice`)
|
||||
- Voice mode flag: requests with `voiceMode: true` should receive a condensed plain-language response (no markdown, no code blocks) — this is a prompt instruction layer, not a library
|
||||
|
||||
**Response shape:**
|
||||
|
||||
```
|
||||
POST /api/voice/synthesize
|
||||
Body: { text: string, voice?: "en_US-lessac-medium" }
|
||||
Response: audio/wav binary stream
|
||||
```
|
||||
|
||||
**Confidence: HIGH** — this is an implementation pattern, not a new library.
|
||||
|
||||
---
|
||||
|
||||
### 5. Audio Playback (Web Chat)
|
||||
|
||||
**No new library needed.** The browser's native `<audio>` element handles WAV and OGG playback. The existing `TtsButton` uses `new Audio(url)` already. The v1.6 enhancement is:
|
||||
|
||||
- Upgrade from `new Audio(blob)` to a proper inline `<audio controls>` player with auto-play toggle stored in settings
|
||||
- Use `URL.createObjectURL(blob)` for streaming playback of TTS responses
|
||||
- Waveform visualization via `AnalyserNode` from the Web Audio API — no library needed
|
||||
|
||||
**Confidence: HIGH** — Web Audio API and `<audio>` are native browser APIs. No library required.
|
||||
**Confidence: MEDIUM** — Client package confirmed on npm (1.5.9, MIT, zero deps). ComfyUI Mac M4 support sourced from offlinecreator.com (March 2026). WebSocket API pattern confirmed via GitHub README.
|
||||
|
||||
---
|
||||
|
||||
## Installation Summary
|
||||
|
||||
```bash
|
||||
# ui/ — add VAD for silence detection + auto-send
|
||||
pnpm --filter @paperclipai/ui add @ricky0123/vad-react
|
||||
# packages/content-renderer/ — new workspace package for Remotion
|
||||
# (Remotion needs its own webpack pipeline; isolate from main server build)
|
||||
pnpm --filter @paperclipai/content-renderer add remotion @remotion/bundler @remotion/renderer
|
||||
|
||||
# server/ — add FFmpeg binary (for audio format conversion) and Telegram bot
|
||||
pnpm --filter @paperclipai/server add ffmpeg-static grammy
|
||||
# server/ — diagram rendering (server-side mermaid)
|
||||
pnpm --filter @paperclipai/server add @mermaid-js/mermaid-cli
|
||||
|
||||
# server/ — types (ffmpeg-static ships its own types; grammy is TS-native)
|
||||
# No @types/* needed for grammy
|
||||
# ffmpeg-static types are included in the package
|
||||
# server/ — PDF generation
|
||||
pnpm --filter @paperclipai/server add playwright-chromium
|
||||
# After install:
|
||||
npx playwright install chromium
|
||||
|
||||
# server/ — social images and OG cards (no-browser SVG→PNG pipeline)
|
||||
pnpm --filter @paperclipai/server add satori resvg-js
|
||||
|
||||
# server/ — color/theme/palette with WCAG
|
||||
pnpm --filter @paperclipai/server add culori
|
||||
|
||||
# server/ — local image generation via ComfyUI (optional, graceful degradation)
|
||||
pnpm --filter @paperclipai/server add @stable-canvas/comfyui-client
|
||||
|
||||
# ui/ — culori for live preview (tree-shakeable ESM, safe to add to both)
|
||||
pnpm --filter @paperclipai/ui add culori
|
||||
```
|
||||
|
||||
---
|
||||
|
|
@ -226,14 +284,18 @@ pnpm --filter @paperclipai/server add ffmpeg-static grammy
|
|||
|
||||
| Avoid | Why | Use Instead |
|
||||
|-------|-----|-------------|
|
||||
| `fluent-ffmpeg` | Archived May 2025, no longer maintained | Direct `child_process.spawn` with `ffmpeg-static` binary |
|
||||
| `@ffmpeg-installer/ffmpeg` | Stale — last updated 2022, ships FFmpeg 4.x | `ffmpeg-static ^5.2.0` (ships FFmpeg 6.1.1, arm64 support) |
|
||||
| `telegraf` | TypeScript type system "too complex to understand" per maintainers; lower weekly downloads than grammY | `grammy ^1.41.1` |
|
||||
| `node-telegram-bot-api` | Low-level, requires callback polling setup, no middleware, more boilerplate | `grammy` |
|
||||
| `@ricky0123/vad-node` | Node.js support was discontinued by the maintainer; wound down | `@ricky0123/vad-react` (browser-only, which is where recording lives) |
|
||||
| `whisper.js` / `transformers.js` (browser WASM) | 200MB+ model download in browser; slow on first load; server-side Whisper via `smart-whisper` is already in place | `smart-whisper` on server (already in v1.5 stack) |
|
||||
| `@mintplex-labs/piper-tts-web` for server TTS | Browser WASM only, no Node.js support | Piper binary via `child_process.spawn` (already specified in v1.5) |
|
||||
| Wake word / real-time streaming audio | Out of scope per PROJECT.md | Future milestone |
|
||||
| `puppeteer-core` | Slower than Playwright in 2026 benchmarks (42ms vs 147ms cold on macOS arm64); Playwright has overtaken it | `playwright-chromium ^1.50.0` |
|
||||
| `@remotion/lambda` | Serverless — Nexus is Mac Mini local-only, no AWS account needed | `@remotion/renderer` local rendering |
|
||||
| `headless-mermaid` | Unmaintained since 2022 | `@mermaid-js/mermaid-cli ^11.12.0` (official) |
|
||||
| `chroma-js` | Limited OKLCH support; culori more accurate for WCAG; community consensus favors culori for design-system work in 2026 | `culori ^4.0.2` |
|
||||
| `tinycolor2` | Minimal WCAG support, no OKLCH, not suited for palette/design-system work | `culori ^4.0.2` |
|
||||
| `canvas` (`node-canvas`) | C++ bindings, complicated install, replaced by `sharp` + `resvg-js` for server-side image ops | `sharp` (already installed) + `resvg-js` |
|
||||
| `jimp` | Pure JS image processing, slow for production image generation workloads | `sharp ^0.34.5` (already installed, libvips-backed) |
|
||||
| `@ffmpeg/ffmpeg` (WASM) | WASM FFmpeg is 10× slower than native binary; intended for browser/serverless | `ffmpeg-static ^5.3.0` (already installed) |
|
||||
| `svg.js` / `svg-builder` | Browser authoring libraries; agents generate SVG strings directly | Agent-generated SVG + `sharp`/`resvg-js` for rendering |
|
||||
| `@vercel/og` | Wraps satori+resvg but adds Vercel Edge Runtime constraints; unnecessary wrapper | `satori` + `resvg-js` directly |
|
||||
| `pdf-lib` | Pure JS PDF creation — no HTML rendering. Correct for form-filling or programmatic PDF assembly, wrong for agent-generated HTML reports | `playwright-chromium` (HTML→PDF) |
|
||||
| ComfyUI with cloud API (Replicate, etc.) | Nexus is air-gap friendly; Mikkel has local GPU (Mac M4) | `@stable-canvas/comfyui-client` pointing to `localhost:8188` |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -241,10 +303,12 @@ pnpm --filter @paperclipai/server add ffmpeg-static grammy
|
|||
|
||||
| Recommended | Alternative | When to Use Alternative |
|
||||
|-------------|-------------|-------------------------|
|
||||
| `grammy ^1.41.1` | `telegraf ^4.x` | If you need a battle-tested library with larger plugin ecosystem and tolerate complex TypeScript types |
|
||||
| `ffmpeg-static` + `spawn` | `@ffmpeg/ffmpeg` (WASM) | If running in a serverless/edge environment where native binaries are not available — not applicable here |
|
||||
| `@ricky0123/vad-react` | Manual `AudioWorklet` energy threshold | If you need lower latency or don't want the 5MB ONNX WASM payload; simpler but less accurate silence detection |
|
||||
| `@ricky0123/vad-react` | `MediaRecorder` with manual stop button (current impl) | The current v1.3 VoiceRecordButton works; VAD is strictly an UX upgrade |
|
||||
| `@remotion/renderer` (local) | `@remotion/lambda` | If you need parallel cloud rendering at scale — N/A for single-user Mac Mini |
|
||||
| `playwright-chromium` | `puppeteer-core` | If project already has Puppeteer deeply integrated; for new work prefer Playwright |
|
||||
| `@mermaid-js/mermaid-cli` (Node API) | `child_process.spawn("mmdc")` | If you need a fully isolated subprocess; Node API is cleaner for the server pattern |
|
||||
| `culori` | `chroma-js ^3.2.0` | If project requires many color interpolation methods (chroma has a richer palette generation API); for WCAG-first design systems culori wins |
|
||||
| `satori` + `resvg-js` | `playwright-chromium` for OG images | Playwright is simpler but adds another headless browser instance; satori is ~100× faster for simple card layouts with no JS |
|
||||
| `@stable-canvas/comfyui-client` | Direct `fetch` to ComfyUI REST API | If ComfyUI API changes and client lags; raw fetch is always viable since ComfyUI API is stable JSON |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -252,69 +316,85 @@ pnpm --filter @paperclipai/server add ffmpeg-static grammy
|
|||
|
||||
| Package | Compatible With | Notes |
|
||||
|---------|-----------------|-------|
|
||||
| `grammy ^1.41.1` | Node.js >=18, TypeScript >=5 | Ships its own types, no `@types/grammy` |
|
||||
| `ffmpeg-static ^5.2.0` | Node.js >=14, macOS arm64 | Downloads correct binary at `npm install` time via `optionalDependencies` |
|
||||
| `@ricky0123/vad-react ^0.0.36` | React 19, Vite 6 | React 19 peer dep fixed in August 2025; requires SharedArrayBuffer (COOP/COEP headers) for ONNX thread worker |
|
||||
| `smart-whisper ^0.8.1` | Node.js >=18, macOS arm64 | From v1.5 — verify it's actually installed before v1.6 starts |
|
||||
| `remotion ^4.0.443` | Node.js >=18, React >=18, TypeScript >=5 | All `@remotion/*` packages must be pinned to the same version; `ffmpeg-static` already installed is detected automatically |
|
||||
| `@mermaid-js/mermaid-cli ^11.12.0` | Node.js >=18, puppeteer (bundled) | Installs its own bundled Chromium (~300MB); version must match `mermaid ^11.12.0` already in `ui/` |
|
||||
| `playwright-chromium ^1.50.0` | Node.js >=18, macOS arm64 | Separate Chromium binary from Remotion and mermaid-cli. Requires `npx playwright install chromium` post-install |
|
||||
| `satori ^0.26.0` | Node.js >=16, ESM | Supports TTF/OTF/WOFF fonts only — WOFF2 is NOT supported. Subset fonts for performance |
|
||||
| `resvg-js ^2.6.2` | Node.js >=14, macOS arm64 | Rust napi-rs binary; downloads macOS arm64 binary at install time |
|
||||
| `culori ^4.0.2` | Node.js >=14, ESM + CJS | Dual package; safe in both `server/` (CJS/ESM mix) and `ui/` (Vite ESM) |
|
||||
| `@stable-canvas/comfyui-client ^1.5.9` | Node.js >=16, zero deps | Requires ComfyUI running at localhost:8188; graceful degradation if not available |
|
||||
|
||||
**Critical COOP/COEP note for `@ricky0123/vad-react`:** The Silero VAD model runs in an ONNX Runtime Web worker that requires `SharedArrayBuffer`. This means the server must send these headers on HTML responses:
|
||||
**Chromium binary count warning:** v1.7 potentially installs THREE separate Chromium binaries:
|
||||
1. `remotion` — `~/.cache/puppeteer/chrome/` or Remotion's own cache
|
||||
2. `@mermaid-js/mermaid-cli` — puppeteer bundled binary
|
||||
3. `playwright-chromium` — `~/.cache/ms-playwright/`
|
||||
|
||||
```
|
||||
Cross-Origin-Opener-Policy: same-origin
|
||||
Cross-Origin-Embedder-Policy: require-corp
|
||||
```
|
||||
|
||||
This is a one-line addition to the Express static file middleware. Without it, VAD silently fails in Chrome/Firefox. The existing PWA service worker may also need `Cross-Origin-Embedder-Policy: require-corp` to avoid breaking.
|
||||
Total disk: ~900MB. For the Mac Mini (256GB+ SSD) this is acceptable. During implementation, investigate whether `@mermaid-js/mermaid-cli` can be pointed at Playwright's Chromium via `PUPPETEER_EXECUTABLE_PATH` to reduce redundancy.
|
||||
|
||||
---
|
||||
|
||||
## Integration Architecture (v1.6 additions only)
|
||||
## Content Generation Package Architecture
|
||||
|
||||
```
|
||||
Browser (UI) Server (Express)
|
||||
───────────────────────────────── ───────────────────────────────────────
|
||||
|
||||
@ricky0123/vad-react POST /api/transcribe (existing)
|
||||
└── useMicVAD ─────→ └── ffmpeg-static: webm→wav16k
|
||||
└── onSpeechEnd(Float32Array) └── smart-whisper: wav→text
|
||||
└── userSpeaking (waveform UI) └── returns { text: string }
|
||||
|
||||
React ChatInput (updated) POST /api/voice/synthesize (new)
|
||||
└── voice mode toggle ─────→ └── Piper binary: text→wav
|
||||
└── auto-send on speech end └── returns audio/wav stream
|
||||
└── <audio> inline player ←──────────────┘
|
||||
|
||||
Telegram bridge (new, optional)
|
||||
└── grammy long polling
|
||||
└── message:text → relayToNexus()
|
||||
└── message:voice →
|
||||
ffmpeg: ogg→wav16k
|
||||
smart-whisper → text
|
||||
relayToNexus() → response
|
||||
Piper → wav
|
||||
ffmpeg: wav→ogg
|
||||
ctx.replyWithVoice()
|
||||
nexus/
|
||||
├── packages/
|
||||
│ └── content-renderer/ ← NEW workspace package
|
||||
│ ├── package.json (remotion, @remotion/bundler, @remotion/renderer)
|
||||
│ ├── src/
|
||||
│ │ ├── compositions/ (React TSX slide templates)
|
||||
│ │ └── index.ts (render() export)
|
||||
│ └── tsconfig.json
|
||||
│
|
||||
├── server/
|
||||
│ └── src/
|
||||
│ └── content/
|
||||
│ ├── diagram.ts (@mermaid-js/mermaid-cli)
|
||||
│ ├── pdf.ts (playwright-chromium)
|
||||
│ ├── social.ts (satori + resvg-js + sharp)
|
||||
│ ├── theme.ts (culori)
|
||||
│ ├── image.ts (@stable-canvas/comfyui-client)
|
||||
│ └── index.ts (route registrations)
|
||||
│
|
||||
└── ui/
|
||||
└── src/
|
||||
└── content/
|
||||
└── ThemePreview.tsx (culori for live WCAG preview)
|
||||
```
|
||||
|
||||
**Why isolate Remotion in its own package:**
|
||||
- Remotion uses its own webpack bundler (`@remotion/bundler`) that conflicts with Vite
|
||||
- Keeping it in `packages/content-renderer/` prevents build pipeline interference
|
||||
- The Express server imports only the render function, not the full webpack config
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling confirmed
|
||||
- [grammY GitHub](https://github.com/grammyjs/grammY) — Bot API 9.6 badge, v1.41.1 version
|
||||
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
|
||||
- [grammY comparison with Telegraf](https://grammy.dev/resources/comparison) — TypeScript type quality comparison
|
||||
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1
|
||||
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
|
||||
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, last published 3 months ago
|
||||
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — fixed August 28 2025, confirmed
|
||||
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz confirmed
|
||||
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement
|
||||
- [nodejs-whisper GitHub](https://github.com/ChetanXpro/nodejs-whisper) — v0.2.9 comparison (rejected: subprocess-based, 10 months stale)
|
||||
- [Piper TTS GitHub releases](https://github.com/rhasspy/piper/releases) — macOS aarch64 binary availability
|
||||
- [Remotion official SSR docs](https://www.remotion.dev/docs/ssr) — `@remotion/renderer` Node.js API confirmed
|
||||
- [Remotion Express render-server template](https://github.com/remotion-dev/template-render-server) — Express integration pattern
|
||||
- [Remotion Chrome headless shell docs](https://www.remotion.dev/docs/miscellaneous/chrome-headless-shell) — Mac arm64 binary confirmed
|
||||
- [remotion npm](https://www.npmjs.com/package/remotion) — version 4.0.443 confirmed
|
||||
- [mermaid-cli GitHub](https://github.com/mermaid-js/mermaid-cli) — Node.js `run()` API, v11.12.0
|
||||
- [playwright-chromium npm](https://www.npmjs.com/package/playwright-chromium) — Chromium-only package
|
||||
- [PDF benchmark 2026](https://pdf4.dev/blog/html-to-pdf-benchmark-2026) — Playwright vs Puppeteer, macOS arm64, Node 22 (MEDIUM confidence — single benchmark source)
|
||||
- [vercel/satori GitHub](https://github.com/vercel/satori) — JSX→SVG, Node.js support, font format constraints
|
||||
- [satori npm](https://www.npmjs.com/package/satori) — version 0.26.0 confirmed
|
||||
- [thx/resvg-js GitHub](https://github.com/thx/resvg-js) — SVG→PNG via Rust napi-rs, macOS arm64 support
|
||||
- [resvg-js npm](https://www.npmjs.com/package/resvg-js) — version 2.6.2 (note: npm showed 0.1.97 for `resvg-js`; the canonical package may be `@resvg/resvg-js` — verify exact package name during implementation)
|
||||
- [culori npm](https://www.npmjs.com/package/culori) — version 4.0.2 confirmed
|
||||
- [culori vs chroma-js 2026](https://www.pkgpulse.com/blog/culori-vs-chroma-js-vs-tinycolor2-color-manipulation-javascript-2026) — OKLCH + WCAG accuracy comparison (MEDIUM confidence)
|
||||
- [@stable-canvas/comfyui-client npm](https://www.npmjs.com/package/@stable-canvas/comfyui-client) — version 1.5.9, MIT, zero deps
|
||||
- [Best local SD setup 2026](https://offlinecreator.com/blog/best-local-stable-diffusion-setup-2026) — ComfyUI Mac M4 support (MEDIUM confidence)
|
||||
|
||||
---
|
||||
|
||||
*Stack research for: Nexus v1.6 Voice Pipeline + Telegram Bridge*
|
||||
*Researched: 2026-04-03*
|
||||
*Supersedes: v1.5 STACK.md entries for smart-whisper and Piper — those remain valid; this file adds the glue and new libraries*
|
||||
**Unresolved — verify during implementation:**
|
||||
1. `resvg-js` package name: npm shows v0.1.97 for `resvg-js` but v2.6.2 for `@resvg/resvg-js`. Use `@resvg/resvg-js ^2.6.2` (the Rust napi-rs backed version).
|
||||
2. Chromium binary deduplication: test whether `@mermaid-js/mermaid-cli` respects `PUPPETEER_EXECUTABLE_PATH` pointing to Playwright's Chromium binary to save ~300MB.
|
||||
3. Remotion webpack vs Vite isolation: confirm that `packages/content-renderer/` with its own `tsconfig.json` and webpack bundler does not affect the root `pnpm build` pipeline.
|
||||
|
||||
---
|
||||
|
||||
*Stack research for: Nexus v1.7 Content Generation*
|
||||
*Researched: 2026-04-04*
|
||||
*Supersedes: v1.6 STACK.md entries remain valid — this file covers only v1.7 additions*
|
||||
|
|
|
|||
|
|
@ -1,185 +1,204 @@
|
|||
# Project Research Summary
|
||||
|
||||
**Project:** Nexus v1.6 — Voice Pipeline + Telegram Bridge
|
||||
**Domain:** Server-side STT/TTS voice pipeline with transport-agnostic service abstraction and a minimal Telegram relay bridge
|
||||
**Researched:** 2026-04-03
|
||||
**Project:** Nexus v1.7 — Content Generation Layer
|
||||
**Domain:** AI-driven local content generation (presentations, diagrams, PDFs, themes, social assets, icons)
|
||||
**Researched:** 2026-04-04
|
||||
**Confidence:** MEDIUM-HIGH
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Nexus v1.6 adds two parallel capability tracks onto an existing React/Express/Paperclip monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a minimal Telegram bridge that reuses those pipeline primitives for phone access. The established expert pattern for this class of system is a shared service abstraction (`voicePipelineService`) that both the web HTTP layer and the Telegram bot call directly — never duplicating STT/TTS logic across transports. The Telegram bridge must be a thin relay only, forwarding messages to the existing `chatService` and returning the response, with no separate bot personality, no rich UI elements, and no per-user conversation branching beyond the existing single-workspace model.
|
||||
Nexus v1.7 adds a local content generation layer to an existing Paperclip fork running on a Mac Mini M4. The scope is narrow but technically deep: agents produce visual and document deliverables (diagrams, PDFs, videos, color themes, social media assets, icons) entirely on-device, with no cloud API calls. The recommended approach is a pipeline of purpose-built libraries — Remotion for video, Playwright for PDFs, satori+resvg-js for social images, culori for OKLCH-based theme generation, and `@mermaid-js/mermaid-cli` for server-side diagrams — routed through a shared async job infrastructure built on top of the existing Paperclip `assets`, `publishLiveEvent`, and `StorageService` systems. Every content type is an installable skill, meaning the content layer is additive and does not touch the upstream Paperclip schema.
|
||||
|
||||
The recommended approach is to build `voicePipelineService` first as the keystone service (`transcribe`, `synthesize`, `formatForVoice`), then wire the web voice UI improvements on top of it, then attach the Telegram bridge as a consumer of the same service. Audio format conversion via `ffmpeg-static` (not the archived `fluent-ffmpeg`) handles the two required transcoding paths: browser WebM/Opus to WAV 16kHz for Whisper, and Telegram OGG/Opus to WAV 16kHz for Whisper. The `@ricky0123/vad-react` library handles browser-side voice activity detection. `grammy ^1.41.1` handles the Telegram bot layer with long polling (correct for a local Mac Mini deployment without a public HTTPS endpoint).
|
||||
The single most important architectural decision is the async job pattern. Long-running renders (Remotion video: 3–10 min, PDF: 1–5 sec, Mermaid: fast) must return a job ID immediately and push progress via the existing SSE live-events bus. Synchronous HTTP for any render is the primary failure path. The second most important decision is Remotion bundle isolation: the webpack bundler must run once at startup in a dedicated `packages/remotion-compositions/` workspace package, never on each render request, and never inside the main Vite/tsc server build context.
|
||||
|
||||
The key risks are: (1) audio format mismatches causing silent transcription failures across browsers and the Telegram path, which require ffmpeg transcoding at every entry point; (2) the voice mode flag being stripped as it traverses the message pipeline layers, causing agents to respond with full markdown that TTS then renders as "asterisk asterisk important asterisk asterisk"; (3) Piper being invoked as a new process per request, causing 200–800ms model reload latency on every TTS response and silent truncation on responses over ~400 characters; and (4) browser autoplay policy blocking audio playback unless the `AudioContext` is unlocked during the user's initial "start voice mode" gesture.
|
||||
|
||||
---
|
||||
The primary risks cluster around three areas: Remotion's CPU/RAM footprint competing with Ollama on the shared M4 machine (mitigated by capping concurrency at 4 and serializing renders with LLM inference); security in the diagram and icon pipeline (Mermaid `securityLevel: "loose"` has documented XSS-to-RCE exploits; all SVG output, AI-generated or not, must pass DOMPurify before reaching the DOM); and storage growth (video renders accumulate fast on a finite Mac Mini SSD — `sourceTaskId` linking and per-type retention policies are mandatory from day one, not deferred cleanup).
|
||||
|
||||
## Key Findings
|
||||
|
||||
### Recommended Stack
|
||||
|
||||
v1.6 is additive to the v1.5 stack. The existing `smart-whisper`, `@mintplex-labs/piper-tts-web`, `multer`, and Express foundations remain unchanged. Three new libraries are required.
|
||||
The v1.7 stack is entirely additive to the v1.6 base (Express, sharp, ffmpeg-static, grammy, mermaid). Seven new library groups cover the new content types. Remotion requires workspace isolation in `packages/content-renderer/` due to its webpack bundler conflicting with Vite. Three separate Chromium binaries will be installed (Remotion, mermaid-cli, Playwright) totaling approximately 900MB on the Mac Mini SSD — acceptable, but worth attempting to share via `PUPPETEER_EXECUTABLE_PATH`.
|
||||
|
||||
One package name needs verification before installation: the correct package may be `@resvg/resvg-js` (v2.6.2, Rust napi-rs) rather than `resvg-js` (v0.1.97, older version). Confirm before `pnpm add`.
|
||||
|
||||
**Core technologies:**
|
||||
|
||||
- `@ricky0123/vad-react ^0.0.36` (ui/) — Browser-side Silero VAD via ONNX Runtime Web; delivers `Float32Array` at 16kHz on speech end; React 19 peer dep confirmed fixed August 2025; requires COOP/COEP headers for `SharedArrayBuffer`
|
||||
- `ffmpeg-static ^5.2.0` (server/) — Ships FFmpeg 6.1.1 binaries including macOS arm64; invoked via `child_process.spawn`; do NOT use the archived `fluent-ffmpeg` (archived May 2025) or stale `@ffmpeg-installer/ffmpeg` (FFmpeg 4.x)
|
||||
- `grammy ^1.41.1` (server/) — TypeScript-native Telegram bot framework (1.4M weekly downloads, higher than Telegraf); long polling for local deployment; clean file handling API via `ctx.getFile()`; Bot API 9.6 support confirmed
|
||||
|
||||
No new library is required for server-side Piper TTS (existing `child_process.spawn` pattern from v1.5) or audio playback (native `<audio>` element + Web Audio API).
|
||||
|
||||
**Critical compatibility note:** `@ricky0123/vad-react` requires COOP/COEP HTTP headers on HTML responses for `SharedArrayBuffer` support. Without them, VAD silently fails in Chrome and Firefox. One-line addition to Express static file middleware.
|
||||
- `remotion ^4.0.443` + `@remotion/bundler` + `@remotion/renderer`: React-based video/presentation rendering, Mac M4 arm64 confirmed, SSR API works in Node.js without browser UI — isolated in `packages/content-renderer/`
|
||||
- `playwright-chromium ^1.50.0`: HTML-to-PDF via headless Chromium, 42ms cold start vs Puppeteer's 147ms (2026 macOS arm64 benchmark), TypeScript-native — installed in `server/`
|
||||
- `@mermaid-js/mermaid-cli ^11.12.0`: Official server-side Mermaid-to-SVG via `run()` API, same version as `mermaid ^11.12.0` already in `ui/` — installed in `server/`
|
||||
- `satori ^0.26.0` + `@resvg/resvg-js ^2.6.2`: JSX/CSS-to-SVG-to-PNG without a browser; used by `@vercel/og` internally; pipeline for OG images, social cards, wallpapers — installed in `server/`
|
||||
- `culori ^4.0.2`: OKLCH-native color math, correct WCAG contrast calculation (0.04045 threshold, not the erroneous 0.03928 in the W3C spec), 2026 community consensus over chroma-js for design-system work — installed in `server/` and `ui/`
|
||||
- `@stable-canvas/comfyui-client ^1.5.9`: Zero-dependency MIT client for ComfyUI REST/WebSocket API; graceful degradation when ComfyUI not running on `localhost:8188` — optional, installed in `server/`
|
||||
- `sharp ^0.34.5` (already installed): image compositing, resizing, format conversion — extended for content use, not re-added
|
||||
- `ffmpeg-static ^5.3.0` (already installed): Remotion detects it automatically via `ensureFfmpeg()`; no second FFmpeg needed
|
||||
|
||||
### Expected Features
|
||||
|
||||
**Must have (table stakes — v1.6 launch):**
|
||||
The FEATURES.md establishes a clear three-tier priority. The critical insight is that the Content Skill System must come first because every other content type depends on it. Satori+Sharp is the single image pipeline for all 2D raster output — do not introduce per-type image libraries.
|
||||
|
||||
- Silence-based auto-submit via `@ricky0123/vad-react` — users expect this; manual stop feels archaic
|
||||
- Waveform/amplitude visualization while recording — without it users cannot confirm mic is active
|
||||
- Voice response auto-play with toggle — users expect playback to be automatic unless disabled
|
||||
- Markdown-free voice responses — spoken markdown sounds broken; dual output (prose + full markdown) is the correct solution
|
||||
- Telegram text relay with agent prefix — core use case for phone access; format: `[AgentName]: response`
|
||||
- Telegram voice note transcription — mobile Telegram users default to voice notes; ignoring them immediately frustrates
|
||||
**Must have (table stakes — P1):**
|
||||
- Download produced file with correct MIME type and `Content-Disposition: attachment`
|
||||
- Preview output before downloading (inline SVG, iframe PDF, Remotion Player, image thumbnail)
|
||||
- Generation status feedback via SSE progress: `queued → generating → ready → error`
|
||||
- Structured error recovery with actionable suggestions (e.g., "Run: ollama pull llava")
|
||||
- Save output to file system with git versioning and PLACEHOLDERS.md manifest integration
|
||||
- Re-generate with revised prompt (store parameters per job)
|
||||
- Content type labeled clearly (distinct icon, preview strategy, type registry)
|
||||
|
||||
**Should have (differentiators, add after validation):**
|
||||
**Should have (differentiators — P1/P2):**
|
||||
- Agent-driven generation from chat (NL → skill routing → file attachment in chat)
|
||||
- Content types as installable skills (each generator is a separate skill file, not a monolithic feature)
|
||||
- PLACEHOLDERS.md manifest integration (draft flag, `prompt_hash`, `generated_at` on every asset)
|
||||
- Seed-color-to-full-theme pipeline with WCAG AA enforced (not optional) using OKLCH
|
||||
- Diagram from natural language (LLM → Mermaid syntax → server-side SVG)
|
||||
- Local-only operation (no data leaves Mac Mini)
|
||||
|
||||
- Telegram TTS reply option (OGG voice note reply back) — add after text relay is validated
|
||||
- Sentence-buffered TTS streaming — start playing sentence 1 while sentence 2 synthesizes; reduces perceived latency
|
||||
|
||||
**Defer (v2+):**
|
||||
|
||||
- Real-time speech-to-speech — requires full-duplex WebSocket audio + Pipecat/LiveKit; entirely different architecture
|
||||
- Wake word detection — always-on mic; hardware device concern
|
||||
- Deep Telegram web chat session sync — requires Postgres pub/sub event bus; explicitly deferred per PROJECT.md
|
||||
- Per-agent Telegram bots — maintenance nightmare; single bot + agent prefix is the correct approach
|
||||
**Defer to v2+:**
|
||||
- Branding media kit (high coordination cost; requires all other generators stable first)
|
||||
- Batch generation (job queue infrastructure not justified for v1.7)
|
||||
- Font embedding in PDF/video (licensing audit required)
|
||||
- Auto-publish to social platforms (OAuth token management, platform API complexity)
|
||||
- Template marketplace
|
||||
|
||||
### Architecture Approach
|
||||
|
||||
The architecture is built around a single server-side `voicePipelineService` that both HTTP voice routes and the Telegram relay call directly, with no HTTP round-trip within the same process. The existing `chatService` and `puterProxyService` are consumed directly by the Telegram bridge as TypeScript function calls. `nexus-settings.json` (not DB) stores `voiceMode` enum and `telegramToken`. No DB schema changes are required.
|
||||
The architecture builds entirely on existing Nexus/Paperclip patterns: factory functions (not classes), `StorageService` for all blob storage, `publishLiveEvent` for SSE fan-out, and the `assets` table for file metadata. The core addition is a `content_jobs` table tracking async render lifecycle, a `renderPipelineService` routing jobs to typed `RendererAdapter` implementations, and a `themeEngineService` as a pure computation service with no DB dependency. The ARCHITECTURE.md is derived from direct codebase inspection (HIGH confidence) — the patterns are proven.
|
||||
|
||||
Content types are implemented as Markdown skill files, not code. Agents read the skill instructions and call `POST /api/companies/:id/content-jobs` with the appropriate `type` and `params`. No new schema is needed for the skill layer.
|
||||
|
||||
**Major components:**
|
||||
|
||||
1. `voicePipelineService` (`server/src/services/voice-pipeline.ts`) — Transport-agnostic STT/TTS core; `transcribe(buffer, format)`, `synthesize(text, voiceId?)`, `formatForVoice(text)` — the keystone abstraction for v1.6
|
||||
2. `telegram service` (`server/src/services/telegram.ts`) — grammY bot lifecycle + thin relay; calls `voicePipelineService` and `chatService` directly; long polling; one persistent `sessionId` per Telegram `chatId`
|
||||
3. `voice.ts` route (`server/src/routes/voice.ts`) — HTTP wrappers for `POST /api/transcribe` (moved from `chat-files.ts`) and new `POST /api/synthesize`; keeps `chat-files.ts` close to upstream for clean rebases
|
||||
4. UI voice components (`VoiceMicButton`, `WaveformDisplay`, `VoiceModeToggle`, `useVoiceMode`, `useSilenceDetection`) — all new; enhance existing `ChatInput` without replacing `VoiceRecordButton`
|
||||
5. `nexus-settings` schema extension — adds `voiceMode: "text" | "voice_input" | "full_voice"` and optional `telegramToken`; no DB migration needed
|
||||
|
||||
**Key patterns to follow:**
|
||||
|
||||
- Move `/transcribe` out of `chat-files.ts` into `voice.ts` to reduce upstream rebase conflict surface
|
||||
- Use `execFile` (not `exec`) for CLI subprocess calls — prevents shell injection, matches existing codebase pattern
|
||||
- Store Telegram token in `nexus-settings.json`, not in DB — DB migrations conflict on rebase
|
||||
- Long polling (`bot.start()`) not webhooks — Mac Mini is behind NAT with no public HTTPS endpoint
|
||||
- Wrap all CLI calls (`piper`, `ffmpeg`) in `Promise.race([call, timeout(8000)])` for graceful degradation
|
||||
1. `contentJobService` — Enqueues async render jobs, emits `content.job.started/done/failed` live events, tracks lifecycle in `content_jobs` table; returns `202 Accepted` with job ID immediately
|
||||
2. `renderPipelineService` — Strategy dispatch: routes `ContentJobType` to the correct `RendererAdapter`; each adapter is independently pluggable behind a shared interface
|
||||
3. `themeEngineService` — Pure OKLCH computation: seed color → palette → WCAG AA validation → CSS/JSON/Tailwind exports; synchronous HTTP, no DB, client-side preview via CSS custom property injection
|
||||
4. Renderer adapters (mermaid, svg, pdf, remotion, image) — each isolated behind `RendererAdapter` interface; binary-dependent adapters in `server/src/services/renderers/`
|
||||
5. `packages/content-renderer/` (Remotion workspace package) — Compositions bundled once at startup; `renderMedia()` called per request against cached bundle path
|
||||
6. UI components — `ContentJobViewer`, `DiagramRenderer`, `ThemePreview`, `ContentGallery` — consume SSE events and existing asset APIs
|
||||
|
||||
### Critical Pitfalls
|
||||
|
||||
1. **Audio format mismatch at every entry point** (Pitfall 27, 28) — Browser produces WebM/Opus. Telegram produces OGG/Opus 48kHz. Whisper requires WAV 16kHz mono. Always transcode via ffmpeg at every audio entry point with explicit `-ar 16000 -ac 1`. Make ffmpeg a hard startup dependency with absolute binary path, not PATH-resolved.
|
||||
The PITFALLS.md has 22 v1.7-specific pitfalls (45–66). The highest-severity items:
|
||||
|
||||
2. **Voice mode flag stripped in message pipeline** (Pitfall 32) — The `voiceMode: true` flag on messages must survive every pipeline layer (client → Express → message persistence → agent session codec → Hermes adapter system prompt). If stripped at any layer, the agent responds in full markdown and TTS synthesizes spoken symbols. Audit every layer before building dual output on top of it.
|
||||
1. **Remotion `bundle()` called per render request** (Pitfall 45) — Webpack takes 2–5 min; server becomes unresponsive under load. Prevention: call `bundle()` once at startup, cache the bundle path, pass only `inputProps` to `renderMedia()` per request.
|
||||
|
||||
3. **Piper process-per-request anti-pattern** (Pitfall 29) — Spawning a new `piper` process per TTS request reloads the ONNX model each time (200–800ms overhead). Long responses (>400 chars) silently truncate. Sentence-chunk text before synthesis. Implement warmup call at server startup. Use absolute binary paths for service-mode deployment.
|
||||
2. **Storage 10MB limit blocks video/large image storage** (Pitfall 48) — The existing `MAX_ATTACHMENT_BYTES = 10MB` and MIME type allowlist reject generated video files. Prevention: separate `MAX_GENERATED_ASSET_BYTES` constant and `generated/` namespace in `StorageService`; write rendered output directly via `putObject`, bypassing the upload route entirely.
|
||||
|
||||
4. **Browser autoplay policy blocking TTS playback** (Pitfall 40) — `audio.play()` is blocked unless triggered by a user gesture. The "start voice mode" button click must unlock an `AudioContext` (`ctx.resume()`); subsequent programmatic playback via `AudioBufferSourceNode` works without further gestures. Developers with autoplay whitelisted in dev browsers never see this failure.
|
||||
3. **Mermaid `securityLevel: "loose"` enabling XSS to RCE** (Pitfall 49) — AI-generated Mermaid syntax with `click` directives executes arbitrary JS. Confirmed exploits in production apps (OneUptime, DeepChat) in 2025–2026. Prevention: always `"strict"`, strip `%%{init}%%` and `click` statements before render, DOMPurify on SVG output.
|
||||
|
||||
5. **Telegram bot event loop blocking on voice pipeline** (Pitfall 37) — File download + ffmpeg transcode + Whisper transcription takes 2–5 seconds. If the handler awaits all of this synchronously, Telegram resends the update and the bot processes the same voice message multiple times. Acknowledge the update immediately, process async, send intermediate "Transcribing..." status to user.
|
||||
4. **HSL-based palette generation producing perceptually incoherent themes** (Pitfall 51) — Equal HSL lightness steps are not perceptually equal; blue at L=50% appears darker than yellow at L=50%. Prevention: use OKLCH via `culori` for all generation; never HSL as an intermediate.
|
||||
|
||||
6. **Piper/ffmpeg not found when running as system service** (Pitfall 38) — `spawn('piper', ...)` resolves via shell PATH in interactive terminals but not in `launchd`/`systemd` service environments. Store absolute binary paths in `nexus-settings` config; use them explicitly in every `spawn()` call.
|
||||
5. **Agent heartbeat timeout too short for long renders** (Pitfall 60) — A 3–10 min video render orphans when the heartbeat exits; task stays `in_progress` indefinitely, or a second render starts. Prevention: fire-and-forget from heartbeat (write job ID to task, exit); a polling routine checks job status and closes the task on completion.
|
||||
|
||||
---
|
||||
6. **Generated assets not linked to originating task** (Pitfall 66) — Orphaned files accumulate on Mac Mini SSD (50–200GB over months). Prevention: `sourceTaskId` is a mandatory field on every generated asset from day one; cleanup job triggers on task deletion.
|
||||
|
||||
7. **AI-generated SVG rendered inline without sanitization** (Pitfall 64) — XSS via `<script>` tags or event handlers in AI-generated SVG when set directly as innerHTML. Prevention: DOMPurify with SVG profile on all AI-generated SVG; prefer `<img src="data:image/svg+xml;base64,...">` over inline SVG for untrusted content.
|
||||
|
||||
## Implications for Roadmap
|
||||
|
||||
Based on research, the component dependency graph strongly suggests a 4-phase structure:
|
||||
Based on the dependency graph in FEATURES.md and the build order in ARCHITECTURE.md, the natural phase structure has seven phases. The critical path runs: storage/job infrastructure → fast no-binary content types → UI pipeline → browser-dependent generators (PDF, video) → optional ML-dependent features.
|
||||
|
||||
### Phase 1: Voice Pipeline Foundation
|
||||
**Rationale:** `voicePipelineService` is the keystone — every other v1.6 feature calls it. Cannot build web voice UI improvements or the Telegram bridge without it. Schema extension for `voiceMode` also gates downstream work. Moving `/transcribe` to `voice.ts` reduces rebase friction before any other work begins.
|
||||
**Delivers:** `nexus-settings` schema with `voiceMode` + `telegramToken`; `voicePipelineService` with `transcribe`, `synthesize`, `formatForVoice`; `voice.ts` route with `/api/transcribe` (moved from `chat-files.ts`) and `/api/synthesize`; ffmpeg integration for WebM→WAV and OGG→WAV transcoding; `voiceMode` flag on `createMessageSchema` and `ChatMessage` shared type
|
||||
**Addresses:** Transport-agnostic pipeline (differentiator unlocking all features), voice mode flag storage (required by all consumers), server-side synthesize endpoint (required by Telegram bridge)
|
||||
**Avoids:** Pitfall 27 (audio format mismatch), Pitfall 32 (voice flag propagation path established before consumers built), Pitfall 38 (absolute binary paths baked in from the start), Pitfall 29 (sentence-chunked synthesis from the start)
|
||||
**Research flag:** Standard patterns — `execFile`, WAV format conversion, service abstraction are well-documented. Skip `/gsd:research-phase`.
|
||||
### Phase 1: Storage and Job Infrastructure
|
||||
**Rationale:** Everything else depends on this. The `content_jobs` table, `renderPipelineService` stub, storage namespace extension, and the 10MB limit fix (Pitfall 48) must exist before any content type can be built. The `sourceTaskId` field (Pitfall 66) must be present from the first asset stored.
|
||||
**Delivers:** `content_jobs` DB migration, `contentJobService`, `renderPipelineService` stub, extended storage namespace, `LIVE_EVENT_TYPES` for content jobs, API route scaffolding, `MAX_GENERATED_ASSET_BYTES` constant
|
||||
**Addresses:** Table stakes (download, status feedback, save to file system, re-generate)
|
||||
**Avoids:** Pitfall 48 (storage size limit), Pitfall 66 (orphaned assets), Pitfall 45 (bundle-per-render pre-empted by establishing async job model), Pitfall 60 (agent heartbeat — async fire-and-forget designed here)
|
||||
|
||||
### Phase 2: Web Chat Voice UI
|
||||
**Rationale:** UI improvements depend only on Phase 1 pipeline and are independent of Telegram. Establishes the voice UX foundation that users interact with directly. Validates the voice mode flag end-to-end before Telegram consumes the same flag.
|
||||
**Delivers:** `VoiceMicButton` with `@ricky0123/vad-react` silence detection; `WaveformDisplay` via AnalyserNode; `VoiceModeToggle` three-state control; `useVoiceMode` and `useSilenceDetection` hooks; `ChatMessage` dual output (voice badge + expandable full markdown); `TtsButton` auto-play prop; COOP/COEP headers on Express static middleware
|
||||
**Addresses:** Silence auto-submit (table stakes), waveform visualization (table stakes), auto-play toggle (table stakes), voice mode setting (table stakes), markdown-free voice responses (table stakes)
|
||||
**Avoids:** Pitfall 31 (VAD library vs. naive RMS threshold), Pitfall 40 (AudioContext unlocked on voice mode start button), Pitfall 35 (sanitizeForTTS utility exists before first TTS integration test)
|
||||
**Research flag:** `@ricky0123/vad-react` API is confirmed via docs; COOP/COEP header pattern is standard Express middleware. Skip `/gsd:research-phase`.
|
||||
### Phase 2: Fast Content Types (No Binary Dependencies)
|
||||
**Rationale:** SVG generation and theme engine are pure TypeScript with no Chromium, Webpack, or binary deps. They validate the end-to-end pipeline (job → render → asset → SSE → UI) at low risk before heavier renderers are added. WCAG contrast correctness (Pitfall 52) and OKLCH color space (Pitfall 51) must be locked here — retrofitting after the theme exporter is built is costly.
|
||||
**Delivers:** `svgGeneratorAdapter` (icons, placeholders, banners), `themeEngineService` (OKLCH, WCAG AA enforcement, CSS/JSON/Tailwind export), placeholder asset system with DRAFT watermark, culori integration
|
||||
**Addresses:** Theme + palette generator (P1), placeholder asset system (P1), icon generation scaffolding, OKLCH export in multiple formats
|
||||
**Avoids:** Pitfall 51 (HSL perceptual incoherence), Pitfall 52 (WCAG linearization error), Pitfall 62 (HEX-only export losing OKLCH)
|
||||
|
||||
### Phase 3: Telegram Bridge
|
||||
**Rationale:** Telegram bridge is a pure consumer of Phase 1's `voicePipelineService` and the existing `chatService`. No web UI changes needed. Must follow Phase 1 but is independent of Phase 2.
|
||||
**Delivers:** `telegramService` with grammY long polling; text relay to `chatService`; voice note relay (OGG download → ffmpeg transcode → transcribe → agent → text reply); persistent `chatId → sessionId` mapping; agent prefix on replies; `POST /api/telegram/token` and `GET /api/telegram/status` management routes
|
||||
**Addresses:** Telegram text relay (table stakes), Telegram voice note relay (table stakes), agent identity visible in Telegram replies (table stakes)
|
||||
**Avoids:** Pitfall 28 (OGG 48kHz → WAV 16kHz explicit transcode, not assumed), Pitfall 33 (persistent session per chatId, not per message), Pitfall 34 (long polling; delete any existing webhook first), Pitfall 37 (async pipeline; acknowledge immediately; send "Transcribing..." status)
|
||||
**Research flag:** Needs `/gsd:research-phase` for grammY session management (persistent `chatId → sessionId` mapping approach vs. grammY conversation plugin) and async update acknowledgement pattern before implementation.
|
||||
### Phase 3: Diagram Generation and Content Gallery UI
|
||||
**Rationale:** Mermaid is the highest-value, lowest-complexity content type. The UI pipeline (ContentJobViewer, DiagramRenderer, ContentGallery) validates the SSE progress flow end-to-end. The Mermaid security config (Pitfall 49) and DOMPurify memory pattern (Pitfall 50) must be established before any diagram renders reach the browser.
|
||||
**Delivers:** `mermaidRendererAdapter` (server-side via `@mermaid-js/mermaid-cli`), `ChatMarkdownMessage` extension for client-side Mermaid fences, `DiagramRenderer` component, `ThemePreview` component, `ContentJobViewer`, `ContentGallery`, `GeneratedAssetCard`, `assetService.list()`
|
||||
**Addresses:** Diagram generation (P1), content type preview, inline diagram rendering in chat
|
||||
**Avoids:** Pitfall 49 (Mermaid XSS/RCE), Pitfall 50 (DOMPurify JSDOM memory accumulation), Pitfall 59 (server-side Mermaid DOM requirement)
|
||||
|
||||
### Phase 4: Polish and Post-Launch Additions
|
||||
**Rationale:** After core voice and Telegram are validated, add differentiator features that require voice pipeline stability. These are explicitly post-validation based on user feedback triggers.
|
||||
**Delivers:** Telegram TTS reply (synthesize OGG voice note reply); sentence-buffered TTS streaming; Piper persistent warmup optimization; voice response history in chat UI
|
||||
**Addresses:** Sentence-buffered TTS (differentiator), Telegram TTS reply (differentiator)
|
||||
**Avoids:** Pitfall 39 (dual output via single LLM call, not two calls), Pitfall 29 (persistent Piper process architecture)
|
||||
**Research flag:** Flag for `/gsd:research-phase` on Piper persistent HTTP wrapper — community `piper-http` package status is unconfirmed; verify before committing to this approach.
|
||||
### Phase 4: Wallpapers and OG Images (Satori Pipeline)
|
||||
**Rationale:** The satori+resvg-js+sharp pipeline is pure Node.js (no Chromium) and covers OG images, social headers, and wallpapers in a single code path. Establishes the reusable 2D raster pipeline before PDF and video introduce heavier binary deps.
|
||||
**Delivers:** Platform-sized image outputs (OG 1200x630, Instagram 1080x1080, desktop wallpaper 2560x1440, etc.), `social.ts` service, platform dimension registry constant
|
||||
**Uses:** satori, @resvg/resvg-js, sharp (already installed)
|
||||
**Addresses:** Wallpapers + OG images (P1), social media content scaffolding (P2)
|
||||
**Avoids:** Pitfall 56 (platform MIME type and dimension constraints encoded as explicit data structure, not magic numbers)
|
||||
|
||||
### Phase 5: PDF Document Generation
|
||||
**Rationale:** PDF introduces the first Chromium binary via `playwright-chromium`. Browser lifecycle must be established as a persistent instance (Pitfall 54) before any template work begins. Font self-hosting (Pitfall 53) must be designed before the first PDF template is considered complete.
|
||||
**Delivers:** `pdfRendererAdapter` (Playwright persistent browser instance), HTML template PDF (reports, one-pagers), pdf-lib for data-driven invoices, font self-hosting via Express static server, PDF download flow in UI
|
||||
**Addresses:** PDF generation (P1)
|
||||
**Avoids:** Pitfall 53 (headless Chromium font loading), Pitfall 54 (Puppeteer launch-per-request overhead)
|
||||
|
||||
### Phase 6: Video and Presentations (Remotion)
|
||||
**Rationale:** Remotion is the highest-complexity and highest-risk content type — webpack bundler conflicts, three Chromium binaries total, M4 concurrency limits, and the agent heartbeat timeout problem. It comes last among P1/P2 features so the async job infrastructure (Phase 1) is fully proven before the longest-running render type is added.
|
||||
**Delivers:** `packages/content-renderer/` workspace package, `remotionRendererAdapter` (CLI subprocess with cached bundle), video playback UI, `onProgress` SSE progress events, render queue with `concurrency: 4` on M4
|
||||
**Addresses:** Remotion presentations + video (P2)
|
||||
**Avoids:** Pitfall 45 (bundle-per-render), Pitfall 46 (Chromium concurrency thrashing), Pitfall 47 (bundler inside compiled server context), Pitfall 55 (video not streamable — onProgress mandatory), Pitfall 63 (pnpm lockfile conflicts — add Remotion immediately after upstream rebase)
|
||||
|
||||
### Phase 7: Content as Skills
|
||||
**Rationale:** No new code — this phase writes Markdown skill files for each content type in `company_skills`. It is last because skill instructions reference API contracts finalized in Phases 1–6. Plugin boundary rules (Pitfall 57) must be enforced before any skill implementation.
|
||||
**Delivers:** Skill markdown files for diagram, theme, PDF, wallpaper, video content types; agent-callable via existing Skill Aggregator
|
||||
**Addresses:** Content types as installable skills (differentiator)
|
||||
**Avoids:** Pitfall 57 (plugin workers bypassing JSON-RPC bridge, using direct HTTP to host API)
|
||||
|
||||
### Phase Ordering Rationale
|
||||
|
||||
- `voicePipelineService` (Phase 1) strictly precedes both Phase 2 and Phase 3 — this is the hardest dependency in the v1.6 graph
|
||||
- Phase 2 and Phase 3 are independent of each other and can run in parallel for two-developer teams; sequential ordering here assumes single-developer delivery
|
||||
- `voiceMode` schema change (Phase 1) must precede `ChatMessage` dual output (Phase 2) — shared package change gates UI work
|
||||
- Moving `/transcribe` from `chat-files.ts` to `voice.ts` in Phase 1 reduces rebase conflict surface before any other work begins
|
||||
- Phase 4 is explicitly post-validation — only add Telegram TTS reply and sentence-buffered streaming after confirming the basic pipeline is stable in real use
|
||||
- Phases 1 → 2 → 3 follow the build-order diagram in ARCHITECTURE.md exactly: infrastructure unblocks fast types, fast types validate the pipeline, UI comes after the first adapter works end-to-end.
|
||||
- Phase 4 (Satori) precedes Phase 5 (PDF) because Satori has no Chromium dep; PDF introduces the first persistent browser instance that the diagram renderer (Phase 3) can optionally reuse to avoid a second Chromium binary.
|
||||
- Phase 6 (Remotion) is last among feature phases because it is CPU/RAM-intensive and its Webpack bundler is a build pipeline risk — isolating it reduces rebase conflict surface.
|
||||
- Phase 7 (Skills) is last because skill instructions reference finalized API contracts.
|
||||
|
||||
---
|
||||
### Research Flags
|
||||
|
||||
Phases likely needing deeper research during planning:
|
||||
- **Phase 6 (Remotion):** Chromium binary count on the specific Mac Mini M4 config (18GB vs 32GB RAM variant changes concurrency budget); Remotion bundle vs Vite isolation needs validation in the actual monorepo build pipeline; run `npx remotion benchmark` before finalizing concurrency setting
|
||||
- **Phase 5 (PDF):** Verify whether `playwright-chromium` and `@mermaid-js/mermaid-cli` can share a Chromium binary via `PUPPETEER_EXECUTABLE_PATH` to reduce total to two binaries instead of three
|
||||
- **Phase 4 (Satori):** Verify correct package name: `@resvg/resvg-js` vs `resvg-js` — npm shows different versions; confirm before `pnpm add`
|
||||
|
||||
Phases with standard patterns (can proceed without additional research):
|
||||
- **Phase 1 (Infrastructure):** Factory function pattern, `content_jobs` table schema, and SSE live events pattern are all directly codebase-confirmed — HIGH confidence, no research needed
|
||||
- **Phase 2 (Theme/SVG):** culori OKLCH API is documented and confirmed; WCAG threshold fix is specific and well-understood
|
||||
- **Phase 3 (Mermaid):** Mermaid CLI Node.js `run()` API confirmed in README; security config is a one-line change with documented correct value
|
||||
- **Phase 7 (Skills):** Skill markdown format is already established in the codebase
|
||||
|
||||
## Confidence Assessment
|
||||
|
||||
| Area | Confidence | Notes |
|
||||
|------|------------|-------|
|
||||
| Stack | MEDIUM-HIGH | grammy HIGH (official docs, Bot API 9.6 verified); ffmpeg-static MEDIUM (arm64 confirmed, pipe approach verified); vad-react MEDIUM (React 19 fix confirmed via GitHub issue; ONNX WASM SharedArrayBuffer behavior requires COOP/COEP header testing) |
|
||||
| Features | MEDIUM-HIGH | STT/TTS pipeline patterns well-documented; dual output prompt engineering reliability is MEDIUM — smaller 7B models produce malformed structured output ~10% of the time; Approach B fallback (post-processing strip) must be implemented |
|
||||
| Architecture | HIGH | Based on direct codebase inspection of actual source files; service boundary and data flow verified; no speculative assumptions |
|
||||
| Pitfalls | HIGH | Based on direct codebase analysis plus targeted research on each integration domain; v1.6 pitfalls 27–40 are specific, sourced, and actionable |
|
||||
| Stack | MEDIUM-HIGH | Remotion HIGH (official SSR docs confirmed). Playwright PDF benchmark MEDIUM (single benchmark source, pdf4.dev March 2026). resvg-js package name LOW (npm shows two packages — verify). culori MEDIUM (version and WCAG claim confirmed via npm + pkgpulse comparison). ComfyUI client MEDIUM (npm confirmed, Mac M4 support sourced from offlinecreator.com). |
|
||||
| Features | MEDIUM-HIGH | Technology capabilities verified via docs. UX expectations inferred from Canva/Pitch/Mermaid Live comparisons. Skill architecture patterns based on existing Nexus skill system. |
|
||||
| Architecture | HIGH | Derived entirely from direct codebase inspection of `/opt/nexus/` on 2026-04-04. Factory function patterns, StorageService interface, live events bus, placeholder service, and asset service all confirmed by reading source files. |
|
||||
| Pitfalls | HIGH | Critical pitfalls verified via multiple sources: Mermaid XSS confirmed via production exploit reports (OneUptime, DeepChat 2025–2026); WCAG linearization error confirmed vs W3C spec; HSL perceptual non-uniformity confirmed by Tailwind CSS 4.0 rationale; Remotion bundle timing confirmed via official Remotion SSR docs. |
|
||||
|
||||
**Overall confidence:** MEDIUM-HIGH
|
||||
|
||||
### Gaps to Address
|
||||
|
||||
- **grammY session management approach:** Lightweight in-memory `Map<chatId, sessionId>` vs. grammY conversation plugin — not evaluated. Validate during Phase 3 research-phase before implementation.
|
||||
- **Dual output prompt reliability on 7B models:** Works reliably on larger models; ~90% on 7B tier. Approach B fallback (post-processing strip) must be implemented as a safety net, not treated as optional. Design both before Phase 1 ships.
|
||||
- **Piper persistent process viability:** Sentence-chunked per-request synthesis avoids the worst of the reload latency, but a persistent Piper HTTP wrapper would be cleaner long-term. Community `piper-http` status unconfirmed. Flag for Phase 4 research-phase.
|
||||
- **smart-whisper OGG support:** Whether `smart-whisper` can ingest OGG directly (avoiding ffmpeg for the Telegram path) or always requires WAV was not confirmed. Verify at Phase 1 start — if OGG is accepted natively, the Telegram transcription path can skip one transcode step.
|
||||
|
||||
---
|
||||
- **resvg-js package name:** Run `npm info @resvg/resvg-js` before `pnpm add` — npm shows divergent versions between `resvg-js` (v0.1.97) and `@resvg/resvg-js` (v2.6.2). Use the scoped package.
|
||||
- **Chromium binary sharing:** Whether `PUPPETEER_EXECUTABLE_PATH` pointing to Playwright's Chromium satisfies `@mermaid-js/mermaid-cli`'s bundled-puppeteer binary requirement needs a 10-minute test on the Mac Mini before Phase 3 begins — could eliminate one ~300MB download.
|
||||
- **Remotion Vite isolation:** Run `pnpm build` after adding `packages/content-renderer/` to the workspace to verify no Vite/webpack conflicts surface before Phase 6 implementation work begins.
|
||||
- **ComfyUI availability:** Image generation (optional, Phase 7) assumes ComfyUI is already installed. Confirm whether this is in scope for v1.7 or defer to v2 — the install is multi-GB (ComfyUI + Flux.1 model).
|
||||
- **pdf-lib scope:** FEATURES.md recommends both Playwright (design-rich PDFs) and pdf-lib (invoices). Confirm whether pdf-lib is in scope for v1.7 or if all PDF is Playwright-only initially during Phase 5 planning.
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling, Bot API 9.6 support
|
||||
- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling vs. webhooks recommendation for local deployment
|
||||
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1, pipe-based invocation pattern
|
||||
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement, 48kHz mono wire format
|
||||
- Direct codebase inspection: `server/src/routes/chat-files.ts`, `chat.ts`, `services/nexus-settings.ts`, `app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `TtsButton.tsx`, `hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts`
|
||||
- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
|
||||
- Direct codebase inspection of `/opt/nexus/` (2026-04-04) — service patterns, StorageService interface, live events bus, asset schema, placeholder service, package.json contents
|
||||
- [Remotion SSR docs](https://www.remotion.dev/docs/ssr) — `@remotion/renderer` Node.js API, bundle caching pattern
|
||||
- [Remotion Express render-server template](https://github.com/remotion-dev/template-render-server) — Express integration confirmed
|
||||
- [vercel/satori GitHub](https://github.com/vercel/satori) — JSX-to-SVG API, font format constraints (TTF/OTF/WOFF, no WOFF2)
|
||||
- [mermaid-cli GitHub](https://github.com/mermaid-js/mermaid-cli) — Node.js `run()` API confirmed
|
||||
- [playwright-chromium npm](https://www.npmjs.com/package/playwright-chromium) — Chromium-only package confirmed
|
||||
- [culori npm](https://www.npmjs.com/package/culori) — version 4.0.2, WCAG functions confirmed
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, React 19 fix confirmed
|
||||
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — React 19 peer dep fix confirmed August 2025
|
||||
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz output confirmed
|
||||
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
|
||||
- [Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
|
||||
- [The Voice AI Stack for Building Agents (assemblyai.com)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
|
||||
- [Telegram speech-to-text bot with Node.js (loonskai.com)](https://www.loonskai.com/blog/telegram-speech-to-text-bot-with-nodejs)
|
||||
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
|
||||
- [PDF benchmark 2026](https://pdf4.dev/blog/html-to-pdf-benchmark-2026) — Playwright vs Puppeteer macOS arm64 timing (single source)
|
||||
- [thx/resvg-js GitHub](https://github.com/thx/resvg-js) — SVG-to-PNG Rust napi-rs; package name ambiguity noted
|
||||
- [culori vs chroma-js 2026](https://www.pkgpulse.com/blog/culori-vs-chroma-js-vs-tinycolor2-color-manipulation-javascript-2026) — OKLCH accuracy comparison
|
||||
- [@stable-canvas/comfyui-client npm](https://www.npmjs.com/package/@stable-canvas/comfyui-client) — zero deps, MIT confirmed
|
||||
- [SocialSizes.io 2026](https://socialsizes.io/) — platform dimension registry
|
||||
|
||||
### Tertiary (LOW confidence — inferred from patterns)
|
||||
- Dual output prompt reliability on 7B models — inferred from structured output community reports; not benchmarked on Hermes specifically
|
||||
- Piper persistent HTTP wrapper — community pattern referenced; `piper-http` package status not verified
|
||||
- `sanitizeForTTS` utility pattern — inferred from TTS pipeline implementations; implementation detail not sourced from a canonical reference
|
||||
### Tertiary (LOW confidence — needs validation during implementation)
|
||||
- [offlinecreator.com — ComfyUI Mac M4 2026](https://offlinecreator.com/blog/best-local-stable-diffusion-setup-2026) — ComfyUI Metal/MPS support on M4
|
||||
- Mermaid XSS via `securityLevel: "loose"` — referenced via exploit reports for OneUptime and DeepChat; the attack vector is documented in the Mermaid changelog and security advisories; specific CVE numbers not cited
|
||||
|
||||
---
|
||||
|
||||
*Research completed: 2026-04-03*
|
||||
*Research completed: 2026-04-04*
|
||||
*Ready for roadmap: yes*
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue