From 0c290139316523800e48b740179f263d87f3c9ec Mon Sep 17 00:00:00 2001 From: Nexus Dev Date: Fri, 3 Apr 2026 23:53:14 +0000 Subject: [PATCH] docs: complete project research --- .planning/research/ARCHITECTURE.md | 833 ++++++++++++++--------------- .planning/research/FEATURES.md | 446 +++++++-------- .planning/research/PITFALLS.md | 505 ++++++++++++++++- .planning/research/STACK.md | 521 ++++++++---------- .planning/research/SUMMARY.md | 240 ++++----- 5 files changed, 1466 insertions(+), 1079 deletions(-) diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md index 1fe48c50..08468d96 100644 --- a/.planning/research/ARCHITECTURE.md +++ b/.planning/research/ARCHITECTURE.md @@ -1,544 +1,507 @@ # Architecture Research -**Domain:** Smart Onboarding + Personal AI Assistant (v1.5) — integration with existing Nexus/Paperclip monorepo -**Researched:** 2026-04-02 +**Domain:** Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo +**Researched:** 2026-04-03 **Confidence:** HIGH — based on direct codebase inspection + verified current documentation --- ## System Overview -The v1.5 features layer on top of the existing monorepo without touching DB schema, API routes, or TypeScript identifiers. Every new service, component, and data flow hooks into the existing extension points: the adapter registry, the secrets service, the instance settings JSONB columns, the chat SSE pipeline, and the onboarding wizard overlay. +v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI. ``` -┌──────────────────────────────────────────────────────────────────────────┐ -│ UI Layer (React/Vite) │ -│ │ -│ ┌─────────────────────────────────────────┐ ┌────────────────────────┐ │ -│ │ NexusOnboardingWizard (MODIFIED) │ │ PersonalAssistantPage │ │ -│ │ ┌──────────────┐ ┌──────────────────┐ │ │ (NEW — lazy loaded) │ │ -│ │ │ ModeSelector │ │ HardwareSummary │ │ │ ┌──────────────────┐ │ │ -│ │ │ (NEW) │ │ (NEW) │ │ │ │ AssistantChatHub │ │ │ -│ │ └──────────────┘ └──────────────────┘ │ │ │ (MODIFIED │ │ │ -│ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ ChatPanel) │ │ │ -│ │ │ProviderSetup │ │ VoiceSetupStep │ │ │ └──────────────────┘ │ │ -│ │ │ (NEW) │ │ (NEW) │ │ └────────────────────────┘ │ -│ │ └──────────────┘ └──────────────────┘ │ │ -│ └─────────────────────────────────────────┘ │ -│ │ -│ ┌───────────────────────────────────────────────────────────────────┐ │ -│ │ Existing Extension Points │ │ -│ │ ChatPanel • ChatInput • useStreamingChat • ChatAgentSelector │ │ -│ └───────────────────────────────────────────────────────────────────┘ │ -└──────────────────────────────────────────────────────────────────────────┘ - ↕ REST + SSE -┌──────────────────────────────────────────────────────────────────────────┐ -│ Server Layer (Express) │ -│ │ -│ NEW routes mounted in app.ts: │ -│ ┌────────────────────┐ ┌───────────────────┐ ┌──────────────────────┐ │ -│ │ /api/hardware │ │ /api/puter-proxy │ │ /api/voice │ │ -│ │ (hardware detect) │ │ (Puter.js relay) │ │ (Whisper + Piper) │ │ -│ └────────────────────┘ └───────────────────┘ └──────────────────────┘ │ -│ ┌────────────────────┐ ┌───────────────────┐ │ -│ │ /api/memory │ │ Existing routes: │ │ -│ │ (assistant memory) │ │ /ollama • /chat │ │ -│ └────────────────────┘ │ /secrets • /llms │ │ -│ └───────────────────┘ │ -│ │ -│ NEW services (named-export pattern, no classes): │ -│ hardwareService • puterProxyService • voiceService • memoryService │ -└──────────────────────────────────────────────────────────────────────────┘ - ↕ Drizzle ORM -┌──────────────────────────────────────────────────────────────────────────┐ -│ Data Layer (PostgreSQL) │ -│ │ -│ NO new tables — all v1.5 state lives in existing extension columns: │ -│ ┌────────────────────────────────────────────────────────────────────┐ │ -│ │ instance_settings.general JSONB (onboarding mode, voice config) │ │ -│ │ company_secrets table (OAuth tokens, Puter token) │ │ -│ │ chat_conversations table (no change — re-used as-is) │ │ -│ └────────────────────────────────────────────────────────────────────┘ │ -│ │ -│ NEW file-based storage (server data dir, no migration needed): │ -│ ┌─────────────────────────────────────────────────────────────────────┐ │ -│ │ data/memory/.json (assistant memory store) │ │ -│ │ data/whisper-models/ (downloaded .bin files) │ │ -│ │ data/piper-voices/ (downloaded .onnx voice files) │ │ -│ └─────────────────────────────────────────────────────────────────────┘ │ -└──────────────────────────────────────────────────────────────────────────┘ - ↕ npx -┌──────────────────────────────────────────────────────────────────────────┐ -│ CLI Layer (Commander.js) │ -│ │ -│ NEW standalone package: packages/buildthis/ │ -│ ┌──────────────────────────────────────────────────────────────────────┐│ -│ │ npx buildthis → detects if Nexus running → opens browser ││ -│ │ OR runs nexus onboard wizard → starts server ││ -│ └──────────────────────────────────────────────────────────────────────┘│ -└──────────────────────────────────────────────────────────────────────────┘ ++-----------------------------------------------------------------------------------+ +| UI Layer (React/Vite) | +| | +| +-------------------------------------------------------------------------+ | +| | ChatPanel / PersonalAssistant (MODIFIED) | | +| | +---------------------+ +--------------------+ +------------------+ | | +| | | VoiceMicButton (NEW)| | WaveformDisplay | | TtsButton (v1.5) | | | +| | | silence detection | | (NEW) animated bars| | + auto-play prop | | | +| | | auto-send on silence| +--------------------+ +------------------+ | | +| | +---------------------+ | | +| | +-------------------------------------------------------------------+ | | +| | | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle | | | +| | +-------------------------------------------------------------------+ | | +| | +-------------------------------------------------------------------+ | | +| | | VoiceModeToggle (NEW) — text only / voice input / full voice | | | +| | +-------------------------------------------------------------------+ | | +| +-------------------------------------------------------------------------+ | ++-----------------------------------------------------------------------------------+ + | HTTP + SSE ++-----------------------------------------------------------------------------------+ +| Server Layer (Express) | +| | +| +------------------------------------+ +------------------------------------+ | +| | voice.ts (NEW route) | | telegram.ts (NEW route/service) | | +| | POST /transcribe (MOVED) | | grammY long-poll process | | +| | POST /synthesize (NEW) | | text + voice relay | | +| +------------------------------------+ +------------------------------------+ | +| | | | +| +-----------------v--------------------------------------------v--------------+ | +| | voicePipelineService (NEW — core) | | +| | transcribe(audioBuffer, format) -> string | | +| | synthesize(text, voiceId?) -> Buffer (WAV) | | +| | formatForVoice(text) -> { voice: string, full: string } | | +| +------------------------------------------------------------------------------+ | +| | | +| +-----------------v--------------------------------------------------------------+| +| | chatService / nexusSettingsService (EXISTING) || +| | conversations . messages . stream SSE . memory . voiceEnabled || +| +--------------------------------------------------------------------------------+| +| | | +| +-----------------v--------------------------------------------------------------+| +| | External Processes (spawned via child_process.spawn / execFile) || +| | whisper-cpp / whisper (STT) piper (TTS) || +| +--------------------------------------------------------------------------------+| ++-----------------------------------------------------------------------------------+ + ^ + | Telegram Bot API (HTTPS long-poll) ++--------+------------------------------------------------------------------------+ +| Telegram (external service) | +| User sends text -> bot relays to chatService -> SSE reply -> bot sends back | +| User sends voice -> bot downloads OGG -> voicePipelineService.transcribe() | +| -> chatService -> reply -> voicePipelineService.synthesize() | +| -> bot sends OGG audio reply | ++----------------------------------------------------------------------------------+ ``` --- -## Component Responsibilities +## Integration Points: New vs. Existing -| Component | Responsibility | New or Modified | Where | -|-----------|----------------|-----------------|-------| -| `NexusOnboardingWizard` | Multi-step onboarding: mode, hardware, provider, voice, summary | MODIFIED (replace single-step) | `ui/src/components/NexusOnboardingWizard.tsx` | -| `ModeSelector` | Card picker: Personal AI / Project Builder / Both | NEW | `ui/src/components/onboarding/ModeSelector.tsx` | -| `HardwareSummaryStep` | Displays detected GPU/RAM/Unified Memory result | NEW | `ui/src/components/onboarding/HardwareSummaryStep.tsx` | -| `ProviderTierStep` | Puter.js auth button, OAuth tier, API key entry | NEW | `ui/src/components/onboarding/ProviderTierStep.tsx` | -| `VoiceSetupStep` | Whisper model picker + Piper voice picker | NEW | `ui/src/components/onboarding/VoiceSetupStep.tsx` | -| `OnboardingSummaryStep` | Final summary before launch | NEW | `ui/src/components/onboarding/OnboardingSummaryStep.tsx` | -| `PersonalAssistantPage` | Full-screen chat experience for assistant mode | NEW | `ui/src/pages/PersonalAssistant.tsx` | -| `AssistantMemoryBar` | Shows memory slots / recall indicator in chat | NEW | `ui/src/components/AssistantMemoryBar.tsx` | -| `hardwareService` | Reads `os.totalmem()`, runs `system_profiler` on macOS for GPU info | NEW | `server/src/services/hardware.ts` | -| `puterProxyService` | Wraps Puter.js Node.js client; relays AI calls through SSE | NEW | `server/src/services/puter-proxy.ts` | -| `voiceService` | Manages Whisper (via `whisper-node`) + Piper (via `@mintplex-labs/piper-tts-web` server-side) | NEW | `server/src/services/voice.ts` | -| `memoryService` | CRUD on file-based JSON memory store; injects context into system prompt | NEW | `server/src/services/memory.ts` | -| `hardwareRoutes` | `GET /api/hardware/info` | NEW | `server/src/routes/hardware.ts` | -| `puterProxyRoutes` | `POST /api/puter-proxy/chat` (SSE), `POST /api/puter-proxy/auth` | NEW | `server/src/routes/puter-proxy.ts` | -| `voiceRoutes` | `POST /api/voice/transcribe`, `POST /api/voice/speak`, `GET /api/voice/status` | NEW | `server/src/routes/voice.ts` | -| `memoryRoutes` | `GET/POST/DELETE /api/companies/:id/memory` | NEW | `server/src/routes/memory.ts` | -| `buildthis` package | `npx buildthis` entry point — detect/launch Nexus | NEW | `packages/buildthis/` | +### What Stays Unchanged + +| Component | Location | Status | +|-----------|----------|--------| +| `chatService` | `server/src/services/chat.ts` | No changes — voice pipeline uses it as-is | +| `nexusSettingsService` | `server/src/services/nexus-settings.ts` | Extend schema only (add `voiceMode`, `telegramToken`) | +| `chatFileRoutes` | `server/src/routes/chat-files.ts` | `/transcribe` moves out; file upload stays | +| `usePiperTts` | `ui/src/hooks/usePiperTts.ts` | No changes — TtsButton continues using browser WASM | +| `TtsButton` | `ui/src/components/TtsButton.tsx` | Add auto-play prop only | +| SSE stream endpoint | `server/src/routes/chat.ts` | No changes — Telegram bridge calls services directly | +| DB schema | `packages/db` | No changes — voice is file/process, not a DB column | + +### What Changes (MODIFIED) + +| Component | Location | Change | +|-----------|----------|--------| +| `VoiceRecordButton` | `ui/src/components/VoiceRecordButton.tsx` | Add silence detection, waveform data emission, auto-send on silence | +| `ChatInput` | `ui/src/components/ChatInput.tsx` | Wire new VoiceMicButton, add voice mode prop | +| `ChatMessage` | `ui/src/components/ChatMessage.tsx` | Show voice_mode badge, show dual output collapse/expand | +| `nexusSettingsSchema` | `server/src/services/nexus-settings.ts` | Add `voiceMode` enum and `telegramToken` optional string | +| `app.ts` | `server/src/app.ts` | Register `voiceRoutes`, `telegramRoutes` | +| `createMessageSchema` | `packages/shared/src/validators/chat.ts` | Add `voiceMode: z.boolean().optional()` flag on messages | +| `ChatMessage` type | `packages/shared/src/types/chat.ts` | Add `voiceMode: boolean | null` field | +| `chat-files.ts` | `server/src/routes/chat-files.ts` | Remove `/transcribe` handler (moved to voice.ts) | + +### What Is New (NEW) + +| Component | Location | Purpose | +|-----------|----------|---------| +| `voicePipelineService` | `server/src/services/voice-pipeline.ts` | Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge | +| `voice.ts` (route) | `server/src/routes/voice.ts` | `POST /api/transcribe`, `POST /api/synthesize` — thin HTTP wrappers | +| `telegram.ts` (service) | `server/src/services/telegram.ts` | grammY bot init, long-poll loop, message relay, voice relay | +| `telegram.ts` (route) | `server/src/routes/telegram.ts` | `GET /api/telegram/status`, `POST /api/telegram/token` management endpoints | +| `VoiceMicButton` | `ui/src/components/VoiceMicButton.tsx` | Enhanced mic button with silence detection and waveform display | +| `WaveformDisplay` | `ui/src/components/WaveformDisplay.tsx` | Animated audio waveform bars using AnalyserNode | +| `VoiceModeToggle` | `ui/src/components/VoiceModeToggle.tsx` | Three-state toggle: text only / voice input / full voice | +| `useVoiceMode` | `ui/src/hooks/useVoiceMode.ts` | Reads/writes voice mode setting via `/api/nexus-settings` | +| `useSilenceDetection` | `ui/src/hooks/useSilenceDetection.ts` | Web Audio API AnalyserNode watching for 1.5s silence threshold | + +--- + +## Component Boundaries + +### voicePipelineService (Core) + +This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other. + +| Method | Input | Output | Implementation | +|--------|-------|--------|----------------| +| `transcribe(buffer, format)` | `Buffer`, `"webm" or "ogg"` | `Promise` | Writes temp file, uses `execFile` (not `exec`) to spawn `whisper-cpp` or `whisper` CLI, reads stdout, cleans up | +| `synthesize(text, voiceId?)` | `string`, optional voiceId | `Promise` | Spawns `piper` CLI via `spawn`, pipes text to stdin, collects WAV stdout | +| `formatForVoice(text)` | `string` | `{ voice: string; full: string }` | Strips code blocks and markdown for voice; returns both variants | + +The `transcribe` method extends the existing `/transcribe` implementation from `chat-files.ts` by adding an `ogg` format path alongside the existing `webm` path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved. + +**Why a dedicated service vs. inline in routes:** +The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation. + +### telegram service + +A thin relay, not a feature-rich bot. It: +1. Holds a single grammY `Bot` instance, initialized when `telegramToken` is set in nexus-settings +2. Routes text messages to `chatService.addMessage()` then collects AI response via `puterProxyService.chatStream()` +3. Routes voice messages — downloads OGG file, calls `voicePipelineService.transcribe()`, then same text path +4. If `voiceMode === "full_voice"`: calls `voicePipelineService.synthesize()`, sends audio back via `ctx.replyWithAudio()` +5. Prefixes agent name on replies: `[Agent Name]: message text` + +**No per-user conversation tracking.** All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md. + +### Voice Route vs. Chat Files Route + +The existing `/transcribe` endpoint lives inside `chatFileRoutes` in `chat-files.ts`. For v1.6, the endpoint moves to a dedicated `voice.ts` route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file. + +Moving the handler reduces merge conflict surface on future upstream rebases of `chat-files.ts`. --- ## Recommended Project Structure ``` -packages/ -├── buildthis/ # NEW — npx buildthis entry point -│ ├── src/ -│ │ └── index.ts # bin entry: detect running Nexus, open browser or run onboard -│ └── package.json # name: "buildthis", bin: { buildthis: "./dist/index.js" } - server/src/ -├── services/ -│ ├── hardware.ts # NEW — detect GPU/RAM/Apple Silicon -│ ├── puter-proxy.ts # NEW — Puter.js Node.js client wrapper -│ ├── voice.ts # NEW — Whisper + Piper lifecycle -│ └── memory.ts # NEW — file-based JSON assistant memory -├── routes/ -│ ├── hardware.ts # NEW — GET /api/hardware/info -│ ├── puter-proxy.ts # NEW — POST /api/puter-proxy/chat (SSE) -│ ├── voice.ts # NEW — POST /api/voice/transcribe, /speak -│ └── memory.ts # NEW — GET/POST/DELETE /companies/:id/memory -└── app.ts # MODIFIED — mount 4 new route sets + app.ts # MODIFY: register voiceRoutes, telegramRoutes + routes/ + chat-files.ts # MODIFY: remove /transcribe handler (moved to voice.ts) + voice.ts # NEW: POST /transcribe, POST /synthesize + nexus-settings.ts # MODIFY: expose voiceMode + telegramToken fields + telegram.ts # NEW: GET /telegram/status, POST /telegram/token + services/ + voice-pipeline.ts # NEW: transcribe(), synthesize(), formatForVoice() + telegram.ts # NEW: grammY bot lifecycle + relay logic + nexus-settings.ts # MODIFY: add voiceMode + telegramToken to schema ui/src/ -├── components/ -│ ├── NexusOnboardingWizard.tsx # MODIFIED — multi-step replaces single-step -│ ├── AssistantMemoryBar.tsx # NEW -│ └── onboarding/ # NEW directory — onboarding step components -│ ├── ModeSelector.tsx -│ ├── HardwareSummaryStep.tsx -│ ├── ProviderTierStep.tsx -│ ├── VoiceSetupStep.tsx -│ └── OnboardingSummaryStep.tsx -├── pages/ -│ └── PersonalAssistant.tsx # NEW — full-screen assistant page -├── hooks/ -│ ├── useHardwareInfo.ts # NEW — query /api/hardware/info -│ ├── usePuterChat.ts # NEW — SSE streaming from puter-proxy -│ ├── useVoiceInput.ts # NEW — Whisper transcription hook -│ ├── useVoiceSpeech.ts # NEW — Piper TTS hook -│ └── useAssistantMemory.ts # NEW — memory CRUD hook -└── api/ - ├── hardware.ts # NEW — typed fetch wrappers - ├── puter-proxy.ts # NEW - ├── voice.ts # NEW - └── memory.ts # NEW + components/ + VoiceMicButton.tsx # NEW: replaces VoiceRecordButton in ChatInput + WaveformDisplay.tsx # NEW: animated bars from AnalyserNode data + VoiceModeToggle.tsx # NEW: 3-state toggle (text / voice-in / full-voice) + VoiceRecordButton.tsx # KEEP as-is (still used in file upload contexts) + TtsButton.tsx # MODIFY: add autoPlay prop + ChatInput.tsx # MODIFY: add VoiceModeToggle, swap in VoiceMicButton + ChatMessage.tsx # MODIFY: voice_mode badge + dual output expand + hooks/ + useVoiceMode.ts # NEW: reads/writes voiceMode setting + useSilenceDetection.ts # NEW: AnalyserNode silence threshold + usePiperTts.ts # KEEP as-is (browser-side TTS unchanged) + +packages/shared/src/ + validators/chat.ts # MODIFY: add voiceMode flag to createMessageSchema + types/chat.ts # MODIFY: add voiceMode field to ChatMessage ``` -### Structure Rationale - -- **`packages/buildthis/`:** Standalone package with its own `package.json` and `bin` field — publishable to npm as `buildthis` independently. Does not depend on the monorepo server package at runtime; it only detects a running Nexus instance via HTTP or launches the CLI onboard flow. -- **`server/src/services/` additions:** All follow the existing named-export pattern (`export function hardwareService() { return { ... } }`). No classes. Dependencies injected as parameters. Drizzle `db` is only accepted if the service actually queries the DB. -- **`ui/src/components/onboarding/`:** Sub-directory isolates the 5 new step components from the main components directory. `NexusOnboardingWizard.tsx` imports them. This limits the upstream-conflict surface to the single wizard file. -- **`ui/src/pages/PersonalAssistant.tsx`:** New route registered in App.tsx routing (the only modification needed in the routing layer). The page re-uses `ChatPanel` with an `assistantMode` prop. - --- ## Architectural Patterns -### Pattern 1: Hardware Detection via Server-Side Shell Probe +### Pattern 1: Transport-Agnostic Voice Service -**What:** `hardwareService` runs on the Express server where it has access to `os.totalmem()` and can shell out to `system_profiler SPDisplaysDataType` on macOS to get GPU details. Apple Silicon unified memory is detected by checking the `cpu_brand_string` for "Apple M". Results are cached in memory (5-minute TTL) so the onboarding wizard can poll cheaply. +**What:** A server service (`voicePipelineService`) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly. -**When to use:** Any time the onboarding wizard needs to display hardware capabilities to make model recommendations. +**When to use:** Any time two transports (web + bot) need the same capability. -**Trade-offs:** Server-side only — the UI cannot do this itself in the browser. The route is scoped to `assertBoard` (existing auth middleware), so it's protected. Apple Silicon reports unified memory as both RAM and VRAM; the service returns `{ unifiedMemory: true, totalBytes }` instead of separate fields. +**Trade-offs:** Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently. -**Example:** +**Shape:** ```typescript -// server/src/services/hardware.ts -export function hardwareService() { - let cache: HardwareInfo | null = null; - let cacheExpiry = 0; - - return { - async detect(): Promise { - if (cache && Date.now() < cacheExpiry) return cache; - const totalBytes = os.totalmem(); - const gpuInfo = await probeGpu(); // shells system_profiler on macOS, /proc/driver/nvidia on Linux - cache = { totalBytes, gpu: gpuInfo, platform: process.platform }; - cacheExpiry = Date.now() + 5 * 60 * 1000; - return cache; - } - }; +// server/src/services/voice-pipeline.ts +export function voicePipelineService() { + // Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern + async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise; + async function synthesize(text: string, voiceId?: string): Promise; + function formatForVoice(text: string): { voice: string; full: string }; + return { transcribe, synthesize, formatForVoice }; } ``` -### Pattern 2: Puter.js as a Server-Side Adapter (not browser-direct) +The existing `/transcribe` handler in `chat-files.ts` already uses `promisify(execFile)` — this pattern is the right model. The service wraps it with format selection (`webm` vs `ogg`) and the same whisper-cpp → openai-whisper cascade. -**What:** Puter.js supports Node.js via `@heyputer/puter.js` with `init(authToken)`. The server acts as a proxy: it holds the Puter auth token (stored in `company_secrets` via the existing `secretService`), forwards chat requests to `puter.ai.chat({ stream: true })`, and pipes the async iterable back to the browser as SSE — exactly the same format the existing `useStreamingChat` hook already consumes. +### Pattern 2: Thin Telegram Relay -**Why not browser-direct:** The existing chat architecture is server-mediated (all agent messages go through Express SSE). Bypassing this would require forking the streaming infrastructure. Using the server as proxy re-uses `useStreamingChat` unchanged and keeps the Puter token off the client. +**What:** The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram. -**When to use:** During onboarding when user selects "Puter.js cloud" tier and authenticates. The Puter auth flow opens a browser popup (`puter.auth.signIn()` must be user-initiated from the UI), receives a token, then POSTs it to `/api/puter-proxy/auth` for server storage. +**When to use:** Building a disposable bridge that will be replaced by a richer implementation later. -**Trade-offs:** One extra round-trip compared to browser-direct, but avoids token exposure and re-uses the existing SSE pipeline. Puter.js Node.js usage requires `@heyputer/puter.js` as a server dependency (not currently in the monorepo). +**Trade-offs:** No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope. -**Example (server-side relay):** +**Shape:** ```typescript -// server/src/routes/puter-proxy.ts -router.post("/api/puter-proxy/chat", async (req, res) => { - const token = await svc.getStoredToken(companyId); - const puter = init(token); - res.setHeader("Content-Type", "text/event-stream"); - const stream = await puter.ai.chat(req.body.messages, { stream: true }); - for await (const chunk of stream) { - res.write(`data: ${JSON.stringify(chunk)}\n\n`); - } - res.end(); +// server/src/services/telegram.ts +import { Bot } from "grammy"; + +export function telegramService(db: Db) { + let bot: Bot | null = null; + + function start(token: string): void; // idempotent, long-poll + function stop(): void; + function isRunning(): boolean; + + return { start, stop, isRunning }; +} +``` + +The bot calls `chatService(db)` and `puterProxyService(db)` directly — no HTTP round-trip to the same server. + +### Pattern 3: Voice Mode Flag on Messages + +**What:** Each message carries an optional `voiceMode: boolean` flag. When `true`, the server formats the response for voice (dual output: `voice` + `full`), and the client auto-plays TTS and shows the full text in a collapsible block. + +**When to use:** Differentiating voice-initiated messages from text messages within the same conversation. + +**Trade-offs:** Adds a field to `createMessageSchema` and the `ChatMessage` type. The field is optional and defaults to `false`, so existing messages and the upstream schema are not broken. + +**Schema change:** +```typescript +// packages/shared/src/validators/chat.ts — additive only +export const createMessageSchema = z.object({ + role: z.enum(["user", "assistant", "system"]), + content: z.string().min(1).max(100_000), + agentId: z.string().uuid().optional(), + messageType: z.string().optional(), + voiceMode: z.boolean().optional(), // NEW in v1.6 }); ``` -### Pattern 3: Whisper on Server, Piper in Browser (Hybrid Voice) +### Pattern 4: Direct Service Calls in Telegram Bridge -**What:** Voice input (speech-to-text) runs server-side via `whisper-node` (Node.js bindings for whisper.cpp). The UI records audio via `MediaRecorder`, POSTs a blob to `POST /api/voice/transcribe`, and gets back a transcript string. Voice output (text-to-speech) uses `@mintplex-labs/piper-tts-web` which runs client-side via WebAssembly — no server round-trip needed for TTS. +**What:** The Telegram bot does not call the Express HTTP API to get AI responses. It calls `chatService(db)` and `puterProxyService(db)` as regular TypeScript function calls within the same server process. -**Why this split:** whisper.cpp requires native binaries that work on CPU-only hardware, which the server controls. Piper TTS web runs via WASM in the browser and has no native dependency — this keeps TTS latency low (no network round-trip) and works even if the server is slow. +**When to use:** Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip. -**When to use:** When user selects "voice mode" in onboarding (VoiceSetupStep). Whisper runs only if the user chooses a local Whisper model (downloaded to `data/whisper-models/`); as a fallback, the browser's native `webkitSpeechRecognition` / `SpeechRecognition` API is used. +**Trade-offs:** Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool. -**Trade-offs:** Whisper download adds 75MB–1.5GB to first-run setup. For CPU-only hardware, whisper-tiny.en (75MB) transcribes in ~2s for a 10s clip on M4 — acceptable. Piper WASM download is ~20MB (models ~30-100MB each). +**Why not HTTP:** A `fetch("http://localhost:PORT/api/...")` call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development. -**Example (voice input hook):** -```typescript -// ui/src/hooks/useVoiceInput.ts -export function useVoiceInput() { - // Records with MediaRecorder → blob → POST /api/voice/transcribe - // Falls back to window.SpeechRecognition if whisper not configured -} -``` +### Pattern 5: grammY Long-Poll for Single-User Local Deployment -### Pattern 4: Persistent Memory via File-Backed JSON (No New DB Table) +**What:** Use grammY `bot.start()` (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running. -**What:** The assistant memory store is a per-workspace JSON file at `data/memory/.json`. Each memory entry has `{ id, content, createdAt, tags }`. The `memoryService` reads this file on startup (lazy-loaded per companyId), keeps it in-process, and writes on mutation. Memory injection works by prepending a formatted memory block to the system prompt at chat-send time in the existing chat service. +**When to use:** Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain. -**Why not PostgreSQL:** Adding a new table violates the "no DB schema changes" constraint for upstream rebase safety. File-backed JSON with an in-process cache is fast for a single-user setup (sub-millisecond reads) and requires no migration. +**Trade-offs:** Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use. -**When to use:** Personal AI Assistant mode only. Project Builder mode does not use the memory service. - -**Trade-offs:** Not transactional. For a single-user local deployment, this is acceptable. File writes are atomic via write-then-rename pattern. Memory search is linear scan (no vector embeddings in v1.5 — semantic search is a future enhancement). - -**Example:** -```typescript -// server/src/services/memory.ts -export function memoryService() { - const cache = new Map(); - - return { - async inject(companyId: string, systemPrompt: string): Promise { - const store = await load(companyId); - if (store.entries.length === 0) return systemPrompt; - const block = store.entries.map(e => `- ${e.content}`).join("\n"); - return `${systemPrompt}\n\n## What I remember about you:\n${block}`; - } - }; -} -``` - -### Pattern 5: Onboarding State via instance_settings.general JSONB - -**What:** All onboarding configuration (selected mode, voice config, active provider tier) is stored in the existing `instance_settings.general` JSONB column. The `instanceSettingsService` already handles arbitrary JSONB keys. Nexus adds its config under a `nexus` namespace key to avoid upstream key collisions. - -**When to use:** Reading/writing onboarding mode, voice model selection, and provider tier configuration. No new table, no migration. - -**Example:** -```typescript -// instance_settings.general.nexus = { -// mode: "personal_ai" | "project_builder" | "both", -// voiceModel: "whisper-tiny.en" | "whisper-base" | null, -// piperVoice: "en_US-amy-medium" | null, -// providerTier: "local" | "puter" | "oauth_gemini" | "api_key", -// } -``` - -### Pattern 6: OAuth Token Storage via Existing Secrets Service - -**What:** OAuth tokens (Google Gemini, OpenAI) and the Puter.js auth token are stored via the existing `secretService` using the `local_encrypted` provider. The onboarding wizard calls `POST /api/companies/:id/secrets` with a well-known name (e.g., `nexus_puter_token`, `nexus_gemini_token`). Adapters read these at spawn time. - -**When to use:** Any time an OAuth flow completes and a token needs persistence. - -**Trade-offs:** Secrets are per-company (workspace), not per-instance. This is fine for single-user setup. The existing secrets UI lets users view/rotate tokens manually. +**Lifecycle:** +- Start: `nexusSettingsService().get()` finds `telegramToken` set → `telegramService(db).start(token)` +- Stop: `server.close()` → `telegramService(db).stop()` +- Runtime toggle: `POST /api/telegram/token` updates nexus-settings and calls start/stop --- ## Data Flow -### Onboarding Wizard Data Flow +### Web Voice Input Flow ``` -User opens Nexus (no workspace yet) - ↓ -NexusOnboardingWizard renders - ↓ -Step 1: ModeSelector → user picks "Personal AI" / "Project Builder" / "Both" - ↓ -Step 2: HardwareSummaryStep - → GET /api/hardware/info (new route) - → hardwareService.detect() → os.totalmem() + system_profiler - → returns { totalGb, gpuName, unifiedMemory, platform } - → wizard shows model tier recommendations - ↓ -Step 3: ProviderTierStep - → Local: already detected via existing Hermes probe - → Puter.js: user clicks "Connect" → puter.auth.signIn() popup - → UI POSTs token to POST /api/puter-proxy/auth - → server stores in secretService("nexus_puter_token") - → OAuth (Gemini/OpenAI): OAuth PKCE flow in a popup window - → callback captured by temp local server or redirect - → token stored via secretService - → API Key: direct input → stored via secretService - ↓ -Step 4: VoiceSetupStep (optional, skippable) - → GET /api/voice/status → check if whisper binary present - → User picks model → POST /api/voice/download (async download + SSE progress) - → User picks Piper voice → stored in instance_settings.general.nexus - ↓ -Step 5: OnboardingSummaryStep - → Creates workspace + agents (existing companiesApi + agentsApi flow) - → Writes nexus config to instance_settings.general.nexus - → Navigates to PersonalAssistant page OR Dashboard based on mode +User holds mic button + | + v +VoiceMicButton: MediaRecorder + AnalyserNode + | + v (silence detected after 1.5s or stop pressed) +POST /api/transcribe {audio: webm blob} + | + v +voice.ts route -> voicePipelineService.transcribe(buffer, "webm") + | + v (whisper-cpp or openai-whisper CLI via execFile) +{ text: "transcribed text" } + | + v +ChatInput fills textarea -> user sends (message tagged voiceMode: true) + | + v +POST /conversations/:id/stream -> chatService + puterProxyService + | + v (SSE tokens arrive) +ChatMessage with voice_mode badge + dual output (voice text + full text collapsible) + | + v +TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5) ``` -### Personal AI Assistant Chat Data Flow +### Server-Side TTS Flow (POST /synthesize) ``` -User types message in AssistantChatHub - ↓ -ChatInput → useStreamingChat.startStream(conversationId, message) - ↓ -POST /api/companies/:id/chat/conversations/:convId/messages - ↓ (existing chat route, no change) -chat route detects "personal_assistant" agent type - ↓ -memoryService.inject(companyId, systemPrompt) ← NEW injection point - ↓ -Route selects provider based on instance_settings.general.nexus.providerTier: - - "local" → existing Hermes adapter (no change) - - "puter" → puterProxyService.chat() → Puter.js Node client → SSE relay - - "oauth_*" → respective provider API with stored OAuth token → SSE relay - ↓ -SSE events stream to UI via existing /api/chat/stream endpoint pattern - ↓ -useStreamingChat receives chunks → ChatMessageList renders them +POST /api/synthesize { text, voiceId? } + | + v +voice.ts route -> voicePipelineService.synthesize(text) + | + v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout) +Response: Content-Type audio/wav, Buffer body + | + v +Client: new Audio(URL.createObjectURL(blob)).play() ``` -### Voice Input Data Flow +Note: Server-side `/synthesize` is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side `usePiperTts` WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward. + +### Telegram Text Message Flow ``` -User presses mic button in ChatInput (MODIFIED) - ↓ -useVoiceInput starts MediaRecorder → records WebM/Opus blob - ↓ -User releases mic → blob POSTed to POST /api/voice/transcribe - ↓ -voiceService.transcribe(audioBuffer) - → whisper-node.transcribe(path) → returns text - ↓ -Text injected into ChatInput.value - ↓ -User reviews → sends normally +Telegram user sends text + | + v +grammY bot.on("message:text") handler + | + v +telegramService: resolveOrCreateConversation(db) + | + v +chatService(db).addMessage(conversationId, { role: "user", content: text }) + | + v +telegramService: collect full response via puterProxyService(db).chatStream() + | + v (if voiceMode !== "full_voice") +ctx.reply("[AgentName]: full_response_text") + + | (if voiceMode === "full_voice") + v +voicePipelineService.formatForVoice(response) -> { voice, full } +ctx.reply("[AgentName]: " + full) -- text message with full details + | + v +voicePipelineService.synthesize(voice) -> WAV Buffer +ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg")) ``` -### npx buildthis Data Flow +### Telegram Voice Message Flow ``` -Developer runs: npx buildthis - ↓ -buildthis/src/index.ts checks for running Nexus: - GET http://localhost:4000/api/health → 200? - YES → open browser to http://localhost:4000 - NO → run nexus onboard wizard (delegates to paperclipai onboard) - OR detect Docker → suggest docker-compose up +Telegram user sends voice note (OGG Opus format) + | + v +grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer + | + v +voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text + | + v +(same path as Telegram text message above) ``` ---- +### nexus-settings Schema Evolution -## Integration Points: New vs Modified +``` +v1.5: { mode, voiceEnabled } +v1.6: { mode, voiceEnabled, voiceMode, telegramToken } -### Server Routes — app.ts (MODIFIED) - -One file to add 4 route mounts. Minimal conflict surface with upstream. - -```typescript -// In server/src/app.ts — add after ollamaRoutes(): -app.use(hardwareRoutes()); -app.use(voiceRoutes()); -app.use(memoryRoutes(db)); -app.use(puterProxyRoutes(db)); + voiceMode: "text" | "voice_input" | "full_voice" (default: "text") + telegramToken: string | undefined (set by user via UI or POST /telegram/token) ``` -### Chat Route — MODIFIED for memory injection - -The existing chat service (`server/src/services/chat.ts`) needs one injection point: when building the system prompt for a conversation, call `memoryService.inject()`. This is scoped to conversations where the agent has `adapterConfig.assistantMode === true`. - -**Risk:** This touches an upstream file. The injection is a 3-line addition inside the message-send handler. Low conflict probability — upstream rarely modifies this section. - -### NexusOnboardingWizard.tsx — REPLACED - -The current single-step wizard becomes a multi-step wizard. Since this file is already a Nexus replacement (not an upstream file), there is zero conflict risk — it will never exist in upstream. - -### App.tsx routing — MODIFIED (one new route) - -Add the `PersonalAssistant` page as a new lazy-loaded route. Minimal upstream conflict (routing section rarely changes). - -### ChatInput.tsx — MODIFIED (voice button) - -Add a microphone button that triggers `useVoiceInput`. This is an upstream file — the modification is additive (new button, no existing logic changed). Conflict risk: LOW, as upstream rarely modifies ChatInput. - ---- - -## Anti-Patterns - -### Anti-Pattern 1: Browser-Direct Puter.js - -**What people do:** Import `@heyputer/puter.js` in the React frontend and call `puter.ai.chat()` directly from the browser. - -**Why it's wrong:** Exposes the Puter auth token in browser storage/network. Bypasses the existing SSE pipeline, requiring a second streaming implementation. Breaks the memory injection pattern (no server-side hook). Cannot use the existing `useStreamingChat` hook. - -**Do this instead:** Use the server proxy pattern (Pattern 2). The UI sends messages to `/api/puter-proxy/chat` exactly like any other chat endpoint. - -### Anti-Pattern 2: New PostgreSQL Tables for Memory - -**What people do:** Create a `assistant_memories` migration with a proper relational schema. - -**Why it's wrong:** Violates the hard constraint: no DB migrations, no schema changes, to keep upstream rebase clean. A migration file created in Nexus will conflict every time upstream adds a migration. - -**Do this instead:** File-backed JSON in the server's data directory (Pattern 4). The single-user M4 Mini deployment will never hit performance limits with this approach. - -### Anti-Pattern 3: Multi-Step Wizard as Modified OnboardingWizard.tsx - -**What people do:** Modify the upstream `OnboardingWizard.tsx` directly to add v1.5 steps. - -**Why it's wrong:** The upstream wizard is actively maintained (120+ upstream commits since fork). Touching it creates guaranteed rebase conflicts. - -**Do this instead:** Continue the existing pattern — `NexusOnboardingWizard.tsx` is already the Nexus replacement via Vite alias. All v1.5 changes go there. Upstream file untouched. - -### Anti-Pattern 4: OAuth in the Browser via Redirect - -**What people do:** Redirect the main app window to the OAuth provider and handle the callback via `window.location`. - -**Why it's wrong:** Loses React state mid-flow. Hard to handle callback URL in a local server that may not have a publicly routable HTTPS endpoint. - -**Do this instead:** Use a popup window for OAuth (`window.open`). The popup handles the full OAuth redirect. On callback, the popup calls `window.opener.postMessage` with the token, closes itself, and the main window receives it. For Puter.js specifically, `puter.auth.signIn()` handles the popup internally. +`voiceMode` is a workspace-level setting (not per-agent). The three states map to: +- `"text"`: mic button transcribes to text input, TTS manual-only, Telegram text-only +- `"voice_input"`: mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out +- `"full_voice"`: mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out --- ## Scaling Considerations -This is a single-user local deployment on an M4 Mini. Scaling is not a concern for v1.5. The architecture is designed for correctness and upstream merge-ability, not horizontal scale. +This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility. -| Concern | Single User (M4 Mini) | -|---------|----------------------| -| Hardware detection | os.totalmem() + sync shell probe, cached 5min — negligible | -| Puter.js relay | One connection at a time, no pooling needed | -| Whisper transcription | ~2s for 10s clip on M4, sequential queue sufficient | -| Memory store | File JSON, <10ms read, no contention | -| Voice TTS | WASM in browser, zero server load | +| Concern | At 1 user (target) | Notes | +|---------|-------------------|-------| +| STT latency | whisper-cpp base.en on M4: ~1-3s | Acceptable; shows transcribing spinner | +| TTS latency | piper CLI on M4: ~0.3-1s for short text | <3s target met | +| Telegram poll | grammY `bot.start()`, 1 process | Adequate for <5,000 msgs/hour | +| Memory overhead | ~10-20MB for polling loop | Acceptable on 16GB+ M4 | +| Piper model | First server-side synthesize: cold start | Piper loads model into memory; subsequent calls fast | --- -## Build Order (Dependency Graph) +## Anti-Patterns -The build order matters because later phases consume services built in earlier ones. +### Anti-Pattern 1: Telegram-Specific Voice Logic -``` -Phase 1: Hardware Detection - → hardwareService (server) - → GET /api/hardware/info (route) - → useHardwareInfo hook (UI) - → HardwareSummaryStep component (UI) - No dependencies on other new phases. +**What people do:** Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler. -Phase 2: Provider Tiers (depends on Phase 1 for display) - → puterProxyService (server) — Puter.js Node client - → secretService integration for token storage (uses EXISTING service) - → POST /api/puter-proxy/auth (route) - → ProviderTierStep component (UI) - → OAuth popup flow (UI) +**Why it's wrong:** Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation. -Phase 3: Multi-Step Onboarding Wizard (depends on Phases 1+2) - → ModeSelector, OnboardingSummaryStep components (UI) - → Refactor NexusOnboardingWizard.tsx into multi-step - → instance_settings.general.nexus config write +**Do this instead:** All voice processing goes through `voicePipelineService`. The Telegram handler calls `transcribe(buf, "ogg")` — the service handles format differences. The web route calls `transcribe(buf, "webm")` — same service, different format argument. -Phase 4: Persistent Memory + Assistant Mode (depends on Phase 3) - → memoryService (server) - → Memory injection in chat route (MODIFIED — highest risk step) - → GET/POST/DELETE /api/companies/:id/memory (routes) - → PersonalAssistantPage (UI) - → useAssistantMemory hook (UI) +### Anti-Pattern 2: Circular HTTP Call for Telegram AI Response -Phase 5: Voice (depends on Phase 3, independent of Phase 4) - → voiceService (server) — whisper-node + piper setup - → POST /api/voice/transcribe, /speak, /status (routes) - → VoiceSetupStep in onboarding (UI) - → useVoiceInput, useVoiceSpeech hooks (UI) - → ChatInput microphone button (MODIFIED — upstream file, low risk) +**What people do:** Telegram bot handler calls `fetch("http://localhost:PORT/api/conversations/:id/stream")` to get AI responses from within the same server process. -Phase 6: npx buildthis (independent of all above) - → packages/buildthis/ new package - → package.json bin field setup - → npm publish configuration -``` +**Why it's wrong:** Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running. -**Recommended sequence:** 1 → 2 → 3 → 4 → 5 → 6. Phase 4 (memory injection into chat route) is the highest-risk upstream-file modification and should come after onboarding is validated. +**Do this instead:** `telegramService` imports `chatService(db)` and `puterProxyService(db)` directly. Collect tokens from the async generator into a string, then send to Telegram as a single message. + +### Anti-Pattern 3: Blocking grammY on Slow CLI Processes + +**What people do:** `await synthesize()` inside a bot handler with no timeout, assuming piper is always available and fast. + +**Why it's wrong:** If the `piper` binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely. + +**Do this instead:** Wrap CLI calls in a `Promise.race([piperCall, timeout(8_000)])`. If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode. + +### Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts + +**What people do:** Leave the STT handler in `chat-files.ts` and call `voicePipelineService` from there, adding Nexus-specific logic to an upstream-sourced file. + +**Why it's wrong:** `chat-files.ts` is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface. + +**Do this instead:** Move `/transcribe` and `/synthesize` to a new `voice.ts` route file (Nexus-only, never in upstream). Keep `chat-files.ts` as close to upstream as possible. + +### Anti-Pattern 5: Storing Telegram Token in Database + +**What people do:** Create a new DB table or add a column to `instance_settings` to store the Telegram bot token. + +**Why it's wrong:** Any DB schema change blocks upstream rebase (migration files conflict). The `nexus-settings.json` file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent. + +**Do this instead:** Store `telegramToken` in `nexus-settings.json` via the existing `nexusSettingsService`. Same pattern as `voiceEnabled`, `mode`. --- -## Integration Points: External Services +## Integration Points -| Service | Integration Pattern | Auth Storage | Notes | -|---------|---------------------|--------------|-------| -| Puter.js | Server-side Node.js client proxy, SSE relay | `company_secrets` table | Token obtained via browser popup on first connect | -| Google Gemini OAuth | PKCE popup flow, access token + refresh token | `company_secrets` table | Policy risk: using Gemini CLI OAuth with third-party apps may trigger abuse detection — use only if user has an active Gemini subscription | -| OpenAI OAuth | PKCE flow via auth.openai.com | `company_secrets` table | Only for free tier / ChatGPT Plus users | -| Whisper (whisper-node) | Native binary, spawned by voiceService | N/A — local binary | Download on first use, cached in data/whisper-models/ | -| Piper TTS | @mintplex-labs/piper-tts-web WASM, runs in browser | N/A — client-side | Model files downloaded to browser cache | -| Ollama | Existing integration (v1.4) — no changes | N/A | ollama.ts service and /ollama routes unchanged | +### External Services + +| Service | Integration Pattern | Notes | +|---------|---------------------|-------| +| Telegram Bot API | grammY `bot.start()` long-polling (Node.js) | No public URL required; polling starts on server boot if token present in nexus-settings | +| whisper-cpp / openai-whisper | `execFile` cascade (same as existing `/transcribe`) | Format argument added: writes `.webm` or `.ogg` temp file based on input | +| piper TTS binary | `child_process.spawn` stdin -> stdout | Text piped to stdin; WAV or raw PCM bytes collected from stdout | + +### Internal Boundaries + +| Boundary | Communication | Notes | +|----------|---------------|-------| +| voice route <-> voicePipelineService | Direct function call | Route is thin HTTP wrapper; all logic in service | +| telegram service <-> voicePipelineService | Direct function call | Same service used by both transports | +| telegram service <-> chatService | Direct function call | Bot calls `chatService(db)` directly — no HTTP round-trip | +| telegram service <-> nexusSettingsService | Direct function call | Reads `voiceMode` and `telegramToken` at start and on each message | +| web UI <-> voice route | REST: `POST /api/transcribe`, `POST /api/synthesize` | Web client uses browser-side piper WASM for TTS; `/synthesize` primarily for Telegram | +| UI VoiceModeToggle <-> nexus-settings | REST: `PATCH /api/nexus-settings` | Reads/writes `voiceMode` setting | + +--- + +## Build Order + +Based on component dependencies, the recommended build order within this milestone: + +| Step | Component(s) | Reason | +|------|-------------|--------| +| 1 | `nexus-settings` schema extensions (`voiceMode`, `telegramToken`) | Everything downstream reads settings | +| 2 | `voicePipelineService` | Backs all voice. No new deps. Independently testable. | +| 3 | `voice.ts` route (`POST /transcribe`, `POST /synthesize`) | Thin wrapper. Register in `app.ts`. Move handler from chat-files. | +| 4 | `VoiceMicButton` + `WaveformDisplay` + `useSilenceDetection` | Pure UI. Depends only on `/transcribe`. | +| 5 | `VoiceModeToggle` + `useVoiceMode` | Depends on `voiceMode` in nexus-settings schema (Step 1). | +| 6 | `ChatMessage` dual output | Depends on `voiceMode` in shared `ChatMessage` type. | +| 7 | `createMessageSchema` + `ChatMessage` type (`voiceMode` flag) | Shared package change. Required by Steps 5-6. Could move earlier. | +| 8 | `telegramService` | Depends on voicePipelineService (2), chatService (existing), nexusSettings (1). | +| 9 | `telegram.ts` route + app.ts registration | Management endpoints. Needs telegramService. | +| 10 | Onboarding STT/TTS hardware detection step | Final: wires all voice detection into onboarding flow. | + +Steps 4-6 can run in parallel with Steps 7-9 if split across phases. --- ## Sources -- Codebase inspection: `/opt/nexus/server/src/`, `/opt/nexus/ui/src/`, `/opt/nexus/packages/` -- Puter.js Node.js support: https://docs.puter.com/supported-platforms/ -- Puter.js chat streaming API: https://docs.puter.com/AI/chat/ -- Puter.js auth flow: https://developer.puter.com/blog/browser-based-auth-puter-js-node/ -- whisper-node npm package: https://www.npmjs.com/package/whisper-node -- Piper TTS WASM: https://www.npmjs.com/package/@mintplex-labs/piper-tts-web -- @xenova/transformers Node.js audio guide: https://huggingface.co/docs/transformers.js/main/en/guides/node-audio-processing -- Google Gemini OAuth: https://ai.google.dev/gemini-api/docs/oauth -- Google Gemini OAuth policy risk: https://github.com/google-gemini/gemini-cli/issues/21866 -- Vectra local vector DB (future memory enhancement): https://github.com/Stevenic/vectra -- Apple Silicon unified memory: https://eclecticlight.co/2022/03/01/making-sense-of-m1-memory-use/ +- Direct codebase inspection: `server/src/routes/chat-files.ts` (lines 297-386), `server/src/routes/chat.ts`, `server/src/services/nexus-settings.ts`, `server/src/app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `ui/src/components/TtsButton.tsx`, `ui/src/hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts` +- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag) +- `.planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md` — existing voice implementation details, WASM TTS pattern +- [grammY documentation](https://grammy.dev/) — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks +- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling recommended for single-user local; Express integration pattern +- [rhasspy/piper (archived)](https://github.com/rhasspy/piper) — CLI: `echo "text" | piper --model voice.onnx -f -`; development moved to OHF-Voice/piper1-gpl Oct 2025 +- grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed --- -*Architecture research for: Nexus v1.5 Smart Onboarding + Personal AI Assistant* -*Researched: 2026-04-02* +*Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6)* +*Researched: 2026-04-03* diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md index d3b9f366..beb4599b 100644 --- a/.planning/research/FEATURES.md +++ b/.planning/research/FEATURES.md @@ -1,22 +1,30 @@ # Feature Research -**Domain:** Smart Onboarding + Personal AI Assistant (Nexus v1.5) -**Researched:** 2026-04-02 -**Confidence:** MEDIUM overall — Puter.js confirmed current, hardware detection patterns confirmed, personal AI assistant patterns from active ecosystem; UX recommendations inferred from patterns +**Domain:** Voice Pipeline (Whisper STT + Piper TTS) + Telegram Bridge (Nexus v1.6) +**Researched:** 2026-04-03 +**Confidence:** MEDIUM-HIGH — STT/TTS pipeline patterns are well-documented; Telegram bot API is stable; dual-output formatting and voice mode UX patterns inferred from ChatGPT/Meta AI voice implementations and community patterns --- ## Milestone Scope -This document covers only the NEW features in v1.5. Existing features (NexusOnboardingWizard, Hermes adapter, Ollama integration, chat interface, PWA, voice input via Whisper) are already built and are dependencies, not deliverables. +This document covers only the NEW features in v1.6. The following are already built and are dependencies, not deliverables: + +- VoiceRecordButton with MediaRecorder API in ChatInput (v1.3) +- TtsButton with @mintplex-labs/piper-tts-web WASM synthesis (v1.3/v1.5) +- POST /transcribe endpoint with whisper-cpp/openai-whisper cascade (v1.3) +- VoiceStep in onboarding wizard (v1.5) +- voiceEnabled in nexus-settings (v1.5) +- Full chat system with streaming SSE (v1.3) **New features being researched:** -- Hardware detection with pre-built model database -- Tiered provider setup: local (Ollama) → zero-config cloud (Puter.js) → OAuth cloud (Gemini, OpenAI) → API key / subscription (Hermes, Claude Code, OpenClaw) -- Personal AI Assistant mode with persistent memory, MCP connections, voice (Whisper + Piper) -- Project handoff: assistant conversation → PM agent with context transfer -- `npx buildthis` CLI entry point -- Every step skippable +- Transport-agnostic voice pipeline (server-side, not just browser WASM) +- Voice mode flag on messages (affects response formatting) +- Dual output pattern: voice-optimized prose + full markdown text +- Web chat voice UI improvements: silence detection, waveform, auto-submit +- Web chat audio playback: inline player, auto-play toggle +- Voice mode toggle setting (text only / voice input / full voice) +- Minimal Telegram bridge: single bot, text + voice relay, agent prefixing --- @@ -24,130 +32,131 @@ This document covers only the NEW features in v1.5. Existing features (NexusOnbo ### Table Stakes (Users Expect These) -Features users assume exist in a modern AI onboarding flow. Missing these makes onboarding feel broken or untrustworthy. +Features users assume exist when voice or Telegram is mentioned. Missing these makes the feature feel broken or incomplete. | Feature | Why Expected | Complexity | Notes | |---------|--------------|------------|-------| -| Hardware auto-detection on first run | Any local AI tool probes GPU/RAM; users expect "it just knows" | MEDIUM | Node.js can read `/proc/meminfo`, spawn `nvidia-smi`, detect Apple Silicon via `os.arch()`; Ollama's `/api/tags` endpoint also reveals loaded models | -| RAM-aware model recommendations | Ollama and LM Studio both do this; users have been trained to expect it | LOW | Pre-built lookup table: <8GB RAM → 3B-7B, 8-16GB → 7B-13B, 16GB+ → 30B+; VRAM takes priority over system RAM | -| Step-skippable onboarding | Any wizard that forces completion feels hostile; Clerk, Vercel, and Postman all allow skip | LOW | Each step needs a "skip" or "set up later" affordance; final summary shows what was skipped | -| Progress indicator | Multi-step wizards without progress indicators cause anxiety ("how many more steps?") | LOW | Step counter or progress bar; 5-7 max steps total | -| Summary screen before entering app | Users need to understand what was set up before being dropped in the dashboard | LOW | Show: mode selected, provider configured, models available; "Start chatting" CTA | -| "Test connection" before saving | Every API key entry form should validate before proceeding | LOW | Quick `/health` or echo call to configured provider; show latency | -| Persisted onboarding state | Refreshing mid-wizard should not restart from step 1 | LOW | LocalStorage or DB; existing NexusOnboardingWizard already handles this pattern | -| Voice input/output toggle | Users who selected voice features expect them to work immediately | MEDIUM | Whisper already exists (v1.3); Piper TTS is the new addition; toggle in assistant settings | -| Persistent conversation memory | Any "personal AI assistant" product ships some form of memory (ChatGPT, Claude Projects, Gemini) | HIGH | Users compare against ChatGPT memory; table stakes for the mode to feel meaningful | -| MCP-style external connections | Power users expect the assistant to connect to their tools (files, git, search) | MEDIUM | MCP is now a universal standard (Anthropic, OpenAI, Google all adopted it); STDIO and HTTP transport both needed | +| Silence-based auto-submit | Every voice input UI (Siri, Google, Whisper demos) stops recording on silence; holding a button feels archaic | MEDIUM | WebRTC VAD or AudioWorklet amplitude monitoring; 1.5s silence threshold typical; must show countdown so user knows what's happening | +| Waveform/amplitude visualization while recording | Users expect visual feedback that the mic is active; a static "recording..." text feels broken | LOW | Canvas or SVG with 30-50 data points; AnalyserNode from Web Audio API; real-time amplitude bars, not pre-rendered waveform | +| Voice response auto-play toggle | If the AI responded with audio, playing it automatically is expected unless the user disabled it; manual play-only feels incomplete | LOW | Boolean setting in nexus-settings (voiceAutoPlay); inline HTML5 `