docs: complete project research
This commit is contained in:
parent
5ea3f2d6b5
commit
f6da67ecf4
5 changed files with 1466 additions and 1079 deletions
|
|
@ -1,544 +1,507 @@
|
|||
# Architecture Research
|
||||
|
||||
**Domain:** Smart Onboarding + Personal AI Assistant (v1.5) — integration with existing Nexus/Paperclip monorepo
|
||||
**Researched:** 2026-04-02
|
||||
**Domain:** Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo
|
||||
**Researched:** 2026-04-03
|
||||
**Confidence:** HIGH — based on direct codebase inspection + verified current documentation
|
||||
|
||||
---
|
||||
|
||||
## System Overview
|
||||
|
||||
The v1.5 features layer on top of the existing monorepo without touching DB schema, API routes, or TypeScript identifiers. Every new service, component, and data flow hooks into the existing extension points: the adapter registry, the secrets service, the instance settings JSONB columns, the chat SSE pipeline, and the onboarding wizard overlay.
|
||||
v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI.
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ UI Layer (React/Vite) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────┐ ┌────────────────────────┐ │
|
||||
│ │ NexusOnboardingWizard (MODIFIED) │ │ PersonalAssistantPage │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────────┐ │ │ (NEW — lazy loaded) │ │
|
||||
│ │ │ ModeSelector │ │ HardwareSummary │ │ │ ┌──────────────────┐ │ │
|
||||
│ │ │ (NEW) │ │ (NEW) │ │ │ │ AssistantChatHub │ │ │
|
||||
│ │ └──────────────┘ └──────────────────┘ │ │ │ (MODIFIED │ │ │
|
||||
│ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ ChatPanel) │ │ │
|
||||
│ │ │ProviderSetup │ │ VoiceSetupStep │ │ │ └──────────────────┘ │ │
|
||||
│ │ │ (NEW) │ │ (NEW) │ │ └────────────────────────┘ │
|
||||
│ │ └──────────────┘ └──────────────────┘ │ │
|
||||
│ └─────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌───────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Existing Extension Points │ │
|
||||
│ │ ChatPanel • ChatInput • useStreamingChat • ChatAgentSelector │ │
|
||||
│ └───────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
↕ REST + SSE
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Server Layer (Express) │
|
||||
│ │
|
||||
│ NEW routes mounted in app.ts: │
|
||||
│ ┌────────────────────┐ ┌───────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ /api/hardware │ │ /api/puter-proxy │ │ /api/voice │ │
|
||||
│ │ (hardware detect) │ │ (Puter.js relay) │ │ (Whisper + Piper) │ │
|
||||
│ └────────────────────┘ └───────────────────┘ └──────────────────────┘ │
|
||||
│ ┌────────────────────┐ ┌───────────────────┐ │
|
||||
│ │ /api/memory │ │ Existing routes: │ │
|
||||
│ │ (assistant memory) │ │ /ollama • /chat │ │
|
||||
│ └────────────────────┘ │ /secrets • /llms │ │
|
||||
│ └───────────────────┘ │
|
||||
│ │
|
||||
│ NEW services (named-export pattern, no classes): │
|
||||
│ hardwareService • puterProxyService • voiceService • memoryService │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
↕ Drizzle ORM
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Data Layer (PostgreSQL) │
|
||||
│ │
|
||||
│ NO new tables — all v1.5 state lives in existing extension columns: │
|
||||
│ ┌────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ instance_settings.general JSONB (onboarding mode, voice config) │ │
|
||||
│ │ company_secrets table (OAuth tokens, Puter token) │ │
|
||||
│ │ chat_conversations table (no change — re-used as-is) │ │
|
||||
│ └────────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ NEW file-based storage (server data dir, no migration needed): │
|
||||
│ ┌─────────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ data/memory/<companyId>.json (assistant memory store) │ │
|
||||
│ │ data/whisper-models/ (downloaded .bin files) │ │
|
||||
│ │ data/piper-voices/ (downloaded .onnx voice files) │ │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
↕ npx
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ CLI Layer (Commander.js) │
|
||||
│ │
|
||||
│ NEW standalone package: packages/buildthis/ │
|
||||
│ ┌──────────────────────────────────────────────────────────────────────┐│
|
||||
│ │ npx buildthis → detects if Nexus running → opens browser ││
|
||||
│ │ OR runs nexus onboard wizard → starts server ││
|
||||
│ └──────────────────────────────────────────────────────────────────────┘│
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
+-----------------------------------------------------------------------------------+
|
||||
| UI Layer (React/Vite) |
|
||||
| |
|
||||
| +-------------------------------------------------------------------------+ |
|
||||
| | ChatPanel / PersonalAssistant (MODIFIED) | |
|
||||
| | +---------------------+ +--------------------+ +------------------+ | |
|
||||
| | | VoiceMicButton (NEW)| | WaveformDisplay | | TtsButton (v1.5) | | |
|
||||
| | | silence detection | | (NEW) animated bars| | + auto-play prop | | |
|
||||
| | | auto-send on silence| +--------------------+ +------------------+ | |
|
||||
| | +---------------------+ | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| | | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle | | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| | | VoiceModeToggle (NEW) — text only / voice input / full voice | | |
|
||||
| | +-------------------------------------------------------------------+ | |
|
||||
| +-------------------------------------------------------------------------+ |
|
||||
+-----------------------------------------------------------------------------------+
|
||||
| HTTP + SSE
|
||||
+-----------------------------------------------------------------------------------+
|
||||
| Server Layer (Express) |
|
||||
| |
|
||||
| +------------------------------------+ +------------------------------------+ |
|
||||
| | voice.ts (NEW route) | | telegram.ts (NEW route/service) | |
|
||||
| | POST /transcribe (MOVED) | | grammY long-poll process | |
|
||||
| | POST /synthesize (NEW) | | text + voice relay | |
|
||||
| +------------------------------------+ +------------------------------------+ |
|
||||
| | | |
|
||||
| +-----------------v--------------------------------------------v--------------+ |
|
||||
| | voicePipelineService (NEW — core) | |
|
||||
| | transcribe(audioBuffer, format) -> string | |
|
||||
| | synthesize(text, voiceId?) -> Buffer (WAV) | |
|
||||
| | formatForVoice(text) -> { voice: string, full: string } | |
|
||||
| +------------------------------------------------------------------------------+ |
|
||||
| | |
|
||||
| +-----------------v--------------------------------------------------------------+|
|
||||
| | chatService / nexusSettingsService (EXISTING) ||
|
||||
| | conversations . messages . stream SSE . memory . voiceEnabled ||
|
||||
| +--------------------------------------------------------------------------------+|
|
||||
| | |
|
||||
| +-----------------v--------------------------------------------------------------+|
|
||||
| | External Processes (spawned via child_process.spawn / execFile) ||
|
||||
| | whisper-cpp / whisper (STT) piper (TTS) ||
|
||||
| +--------------------------------------------------------------------------------+|
|
||||
+-----------------------------------------------------------------------------------+
|
||||
^
|
||||
| Telegram Bot API (HTTPS long-poll)
|
||||
+--------+------------------------------------------------------------------------+
|
||||
| Telegram (external service) |
|
||||
| User sends text -> bot relays to chatService -> SSE reply -> bot sends back |
|
||||
| User sends voice -> bot downloads OGG -> voicePipelineService.transcribe() |
|
||||
| -> chatService -> reply -> voicePipelineService.synthesize() |
|
||||
| -> bot sends OGG audio reply |
|
||||
+----------------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Component Responsibilities
|
||||
## Integration Points: New vs. Existing
|
||||
|
||||
| Component | Responsibility | New or Modified | Where |
|
||||
|-----------|----------------|-----------------|-------|
|
||||
| `NexusOnboardingWizard` | Multi-step onboarding: mode, hardware, provider, voice, summary | MODIFIED (replace single-step) | `ui/src/components/NexusOnboardingWizard.tsx` |
|
||||
| `ModeSelector` | Card picker: Personal AI / Project Builder / Both | NEW | `ui/src/components/onboarding/ModeSelector.tsx` |
|
||||
| `HardwareSummaryStep` | Displays detected GPU/RAM/Unified Memory result | NEW | `ui/src/components/onboarding/HardwareSummaryStep.tsx` |
|
||||
| `ProviderTierStep` | Puter.js auth button, OAuth tier, API key entry | NEW | `ui/src/components/onboarding/ProviderTierStep.tsx` |
|
||||
| `VoiceSetupStep` | Whisper model picker + Piper voice picker | NEW | `ui/src/components/onboarding/VoiceSetupStep.tsx` |
|
||||
| `OnboardingSummaryStep` | Final summary before launch | NEW | `ui/src/components/onboarding/OnboardingSummaryStep.tsx` |
|
||||
| `PersonalAssistantPage` | Full-screen chat experience for assistant mode | NEW | `ui/src/pages/PersonalAssistant.tsx` |
|
||||
| `AssistantMemoryBar` | Shows memory slots / recall indicator in chat | NEW | `ui/src/components/AssistantMemoryBar.tsx` |
|
||||
| `hardwareService` | Reads `os.totalmem()`, runs `system_profiler` on macOS for GPU info | NEW | `server/src/services/hardware.ts` |
|
||||
| `puterProxyService` | Wraps Puter.js Node.js client; relays AI calls through SSE | NEW | `server/src/services/puter-proxy.ts` |
|
||||
| `voiceService` | Manages Whisper (via `whisper-node`) + Piper (via `@mintplex-labs/piper-tts-web` server-side) | NEW | `server/src/services/voice.ts` |
|
||||
| `memoryService` | CRUD on file-based JSON memory store; injects context into system prompt | NEW | `server/src/services/memory.ts` |
|
||||
| `hardwareRoutes` | `GET /api/hardware/info` | NEW | `server/src/routes/hardware.ts` |
|
||||
| `puterProxyRoutes` | `POST /api/puter-proxy/chat` (SSE), `POST /api/puter-proxy/auth` | NEW | `server/src/routes/puter-proxy.ts` |
|
||||
| `voiceRoutes` | `POST /api/voice/transcribe`, `POST /api/voice/speak`, `GET /api/voice/status` | NEW | `server/src/routes/voice.ts` |
|
||||
| `memoryRoutes` | `GET/POST/DELETE /api/companies/:id/memory` | NEW | `server/src/routes/memory.ts` |
|
||||
| `buildthis` package | `npx buildthis` entry point — detect/launch Nexus | NEW | `packages/buildthis/` |
|
||||
### What Stays Unchanged
|
||||
|
||||
| Component | Location | Status |
|
||||
|-----------|----------|--------|
|
||||
| `chatService` | `server/src/services/chat.ts` | No changes — voice pipeline uses it as-is |
|
||||
| `nexusSettingsService` | `server/src/services/nexus-settings.ts` | Extend schema only (add `voiceMode`, `telegramToken`) |
|
||||
| `chatFileRoutes` | `server/src/routes/chat-files.ts` | `/transcribe` moves out; file upload stays |
|
||||
| `usePiperTts` | `ui/src/hooks/usePiperTts.ts` | No changes — TtsButton continues using browser WASM |
|
||||
| `TtsButton` | `ui/src/components/TtsButton.tsx` | Add auto-play prop only |
|
||||
| SSE stream endpoint | `server/src/routes/chat.ts` | No changes — Telegram bridge calls services directly |
|
||||
| DB schema | `packages/db` | No changes — voice is file/process, not a DB column |
|
||||
|
||||
### What Changes (MODIFIED)
|
||||
|
||||
| Component | Location | Change |
|
||||
|-----------|----------|--------|
|
||||
| `VoiceRecordButton` | `ui/src/components/VoiceRecordButton.tsx` | Add silence detection, waveform data emission, auto-send on silence |
|
||||
| `ChatInput` | `ui/src/components/ChatInput.tsx` | Wire new VoiceMicButton, add voice mode prop |
|
||||
| `ChatMessage` | `ui/src/components/ChatMessage.tsx` | Show voice_mode badge, show dual output collapse/expand |
|
||||
| `nexusSettingsSchema` | `server/src/services/nexus-settings.ts` | Add `voiceMode` enum and `telegramToken` optional string |
|
||||
| `app.ts` | `server/src/app.ts` | Register `voiceRoutes`, `telegramRoutes` |
|
||||
| `createMessageSchema` | `packages/shared/src/validators/chat.ts` | Add `voiceMode: z.boolean().optional()` flag on messages |
|
||||
| `ChatMessage` type | `packages/shared/src/types/chat.ts` | Add `voiceMode: boolean | null` field |
|
||||
| `chat-files.ts` | `server/src/routes/chat-files.ts` | Remove `/transcribe` handler (moved to voice.ts) |
|
||||
|
||||
### What Is New (NEW)
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| `voicePipelineService` | `server/src/services/voice-pipeline.ts` | Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge |
|
||||
| `voice.ts` (route) | `server/src/routes/voice.ts` | `POST /api/transcribe`, `POST /api/synthesize` — thin HTTP wrappers |
|
||||
| `telegram.ts` (service) | `server/src/services/telegram.ts` | grammY bot init, long-poll loop, message relay, voice relay |
|
||||
| `telegram.ts` (route) | `server/src/routes/telegram.ts` | `GET /api/telegram/status`, `POST /api/telegram/token` management endpoints |
|
||||
| `VoiceMicButton` | `ui/src/components/VoiceMicButton.tsx` | Enhanced mic button with silence detection and waveform display |
|
||||
| `WaveformDisplay` | `ui/src/components/WaveformDisplay.tsx` | Animated audio waveform bars using AnalyserNode |
|
||||
| `VoiceModeToggle` | `ui/src/components/VoiceModeToggle.tsx` | Three-state toggle: text only / voice input / full voice |
|
||||
| `useVoiceMode` | `ui/src/hooks/useVoiceMode.ts` | Reads/writes voice mode setting via `/api/nexus-settings` |
|
||||
| `useSilenceDetection` | `ui/src/hooks/useSilenceDetection.ts` | Web Audio API AnalyserNode watching for 1.5s silence threshold |
|
||||
|
||||
---
|
||||
|
||||
## Component Boundaries
|
||||
|
||||
### voicePipelineService (Core)
|
||||
|
||||
This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other.
|
||||
|
||||
| Method | Input | Output | Implementation |
|
||||
|--------|-------|--------|----------------|
|
||||
| `transcribe(buffer, format)` | `Buffer`, `"webm" or "ogg"` | `Promise<string>` | Writes temp file, uses `execFile` (not `exec`) to spawn `whisper-cpp` or `whisper` CLI, reads stdout, cleans up |
|
||||
| `synthesize(text, voiceId?)` | `string`, optional voiceId | `Promise<Buffer>` | Spawns `piper` CLI via `spawn`, pipes text to stdin, collects WAV stdout |
|
||||
| `formatForVoice(text)` | `string` | `{ voice: string; full: string }` | Strips code blocks and markdown for voice; returns both variants |
|
||||
|
||||
The `transcribe` method extends the existing `/transcribe` implementation from `chat-files.ts` by adding an `ogg` format path alongside the existing `webm` path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved.
|
||||
|
||||
**Why a dedicated service vs. inline in routes:**
|
||||
The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation.
|
||||
|
||||
### telegram service
|
||||
|
||||
A thin relay, not a feature-rich bot. It:
|
||||
1. Holds a single grammY `Bot` instance, initialized when `telegramToken` is set in nexus-settings
|
||||
2. Routes text messages to `chatService.addMessage()` then collects AI response via `puterProxyService.chatStream()`
|
||||
3. Routes voice messages — downloads OGG file, calls `voicePipelineService.transcribe()`, then same text path
|
||||
4. If `voiceMode === "full_voice"`: calls `voicePipelineService.synthesize()`, sends audio back via `ctx.replyWithAudio()`
|
||||
5. Prefixes agent name on replies: `[Agent Name]: message text`
|
||||
|
||||
**No per-user conversation tracking.** All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md.
|
||||
|
||||
### Voice Route vs. Chat Files Route
|
||||
|
||||
The existing `/transcribe` endpoint lives inside `chatFileRoutes` in `chat-files.ts`. For v1.6, the endpoint moves to a dedicated `voice.ts` route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file.
|
||||
|
||||
Moving the handler reduces merge conflict surface on future upstream rebases of `chat-files.ts`.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Project Structure
|
||||
|
||||
```
|
||||
packages/
|
||||
├── buildthis/ # NEW — npx buildthis entry point
|
||||
│ ├── src/
|
||||
│ │ └── index.ts # bin entry: detect running Nexus, open browser or run onboard
|
||||
│ └── package.json # name: "buildthis", bin: { buildthis: "./dist/index.js" }
|
||||
|
||||
server/src/
|
||||
├── services/
|
||||
│ ├── hardware.ts # NEW — detect GPU/RAM/Apple Silicon
|
||||
│ ├── puter-proxy.ts # NEW — Puter.js Node.js client wrapper
|
||||
│ ├── voice.ts # NEW — Whisper + Piper lifecycle
|
||||
│ └── memory.ts # NEW — file-based JSON assistant memory
|
||||
├── routes/
|
||||
│ ├── hardware.ts # NEW — GET /api/hardware/info
|
||||
│ ├── puter-proxy.ts # NEW — POST /api/puter-proxy/chat (SSE)
|
||||
│ ├── voice.ts # NEW — POST /api/voice/transcribe, /speak
|
||||
│ └── memory.ts # NEW — GET/POST/DELETE /companies/:id/memory
|
||||
└── app.ts # MODIFIED — mount 4 new route sets
|
||||
app.ts # MODIFY: register voiceRoutes, telegramRoutes
|
||||
routes/
|
||||
chat-files.ts # MODIFY: remove /transcribe handler (moved to voice.ts)
|
||||
voice.ts # NEW: POST /transcribe, POST /synthesize
|
||||
nexus-settings.ts # MODIFY: expose voiceMode + telegramToken fields
|
||||
telegram.ts # NEW: GET /telegram/status, POST /telegram/token
|
||||
services/
|
||||
voice-pipeline.ts # NEW: transcribe(), synthesize(), formatForVoice()
|
||||
telegram.ts # NEW: grammY bot lifecycle + relay logic
|
||||
nexus-settings.ts # MODIFY: add voiceMode + telegramToken to schema
|
||||
|
||||
ui/src/
|
||||
├── components/
|
||||
│ ├── NexusOnboardingWizard.tsx # MODIFIED — multi-step replaces single-step
|
||||
│ ├── AssistantMemoryBar.tsx # NEW
|
||||
│ └── onboarding/ # NEW directory — onboarding step components
|
||||
│ ├── ModeSelector.tsx
|
||||
│ ├── HardwareSummaryStep.tsx
|
||||
│ ├── ProviderTierStep.tsx
|
||||
│ ├── VoiceSetupStep.tsx
|
||||
│ └── OnboardingSummaryStep.tsx
|
||||
├── pages/
|
||||
│ └── PersonalAssistant.tsx # NEW — full-screen assistant page
|
||||
├── hooks/
|
||||
│ ├── useHardwareInfo.ts # NEW — query /api/hardware/info
|
||||
│ ├── usePuterChat.ts # NEW — SSE streaming from puter-proxy
|
||||
│ ├── useVoiceInput.ts # NEW — Whisper transcription hook
|
||||
│ ├── useVoiceSpeech.ts # NEW — Piper TTS hook
|
||||
│ └── useAssistantMemory.ts # NEW — memory CRUD hook
|
||||
└── api/
|
||||
├── hardware.ts # NEW — typed fetch wrappers
|
||||
├── puter-proxy.ts # NEW
|
||||
├── voice.ts # NEW
|
||||
└── memory.ts # NEW
|
||||
components/
|
||||
VoiceMicButton.tsx # NEW: replaces VoiceRecordButton in ChatInput
|
||||
WaveformDisplay.tsx # NEW: animated bars from AnalyserNode data
|
||||
VoiceModeToggle.tsx # NEW: 3-state toggle (text / voice-in / full-voice)
|
||||
VoiceRecordButton.tsx # KEEP as-is (still used in file upload contexts)
|
||||
TtsButton.tsx # MODIFY: add autoPlay prop
|
||||
ChatInput.tsx # MODIFY: add VoiceModeToggle, swap in VoiceMicButton
|
||||
ChatMessage.tsx # MODIFY: voice_mode badge + dual output expand
|
||||
hooks/
|
||||
useVoiceMode.ts # NEW: reads/writes voiceMode setting
|
||||
useSilenceDetection.ts # NEW: AnalyserNode silence threshold
|
||||
usePiperTts.ts # KEEP as-is (browser-side TTS unchanged)
|
||||
|
||||
packages/shared/src/
|
||||
validators/chat.ts # MODIFY: add voiceMode flag to createMessageSchema
|
||||
types/chat.ts # MODIFY: add voiceMode field to ChatMessage
|
||||
```
|
||||
|
||||
### Structure Rationale
|
||||
|
||||
- **`packages/buildthis/`:** Standalone package with its own `package.json` and `bin` field — publishable to npm as `buildthis` independently. Does not depend on the monorepo server package at runtime; it only detects a running Nexus instance via HTTP or launches the CLI onboard flow.
|
||||
- **`server/src/services/` additions:** All follow the existing named-export pattern (`export function hardwareService() { return { ... } }`). No classes. Dependencies injected as parameters. Drizzle `db` is only accepted if the service actually queries the DB.
|
||||
- **`ui/src/components/onboarding/`:** Sub-directory isolates the 5 new step components from the main components directory. `NexusOnboardingWizard.tsx` imports them. This limits the upstream-conflict surface to the single wizard file.
|
||||
- **`ui/src/pages/PersonalAssistant.tsx`:** New route registered in App.tsx routing (the only modification needed in the routing layer). The page re-uses `ChatPanel` with an `assistantMode` prop.
|
||||
|
||||
---
|
||||
|
||||
## Architectural Patterns
|
||||
|
||||
### Pattern 1: Hardware Detection via Server-Side Shell Probe
|
||||
### Pattern 1: Transport-Agnostic Voice Service
|
||||
|
||||
**What:** `hardwareService` runs on the Express server where it has access to `os.totalmem()` and can shell out to `system_profiler SPDisplaysDataType` on macOS to get GPU details. Apple Silicon unified memory is detected by checking the `cpu_brand_string` for "Apple M". Results are cached in memory (5-minute TTL) so the onboarding wizard can poll cheaply.
|
||||
**What:** A server service (`voicePipelineService`) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly.
|
||||
|
||||
**When to use:** Any time the onboarding wizard needs to display hardware capabilities to make model recommendations.
|
||||
**When to use:** Any time two transports (web + bot) need the same capability.
|
||||
|
||||
**Trade-offs:** Server-side only — the UI cannot do this itself in the browser. The route is scoped to `assertBoard` (existing auth middleware), so it's protected. Apple Silicon reports unified memory as both RAM and VRAM; the service returns `{ unifiedMemory: true, totalBytes }` instead of separate fields.
|
||||
**Trade-offs:** Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently.
|
||||
|
||||
**Example:**
|
||||
**Shape:**
|
||||
```typescript
|
||||
// server/src/services/hardware.ts
|
||||
export function hardwareService() {
|
||||
let cache: HardwareInfo | null = null;
|
||||
let cacheExpiry = 0;
|
||||
|
||||
return {
|
||||
async detect(): Promise<HardwareInfo> {
|
||||
if (cache && Date.now() < cacheExpiry) return cache;
|
||||
const totalBytes = os.totalmem();
|
||||
const gpuInfo = await probeGpu(); // shells system_profiler on macOS, /proc/driver/nvidia on Linux
|
||||
cache = { totalBytes, gpu: gpuInfo, platform: process.platform };
|
||||
cacheExpiry = Date.now() + 5 * 60 * 1000;
|
||||
return cache;
|
||||
}
|
||||
};
|
||||
// server/src/services/voice-pipeline.ts
|
||||
export function voicePipelineService() {
|
||||
// Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern
|
||||
async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise<string>;
|
||||
async function synthesize(text: string, voiceId?: string): Promise<Buffer>;
|
||||
function formatForVoice(text: string): { voice: string; full: string };
|
||||
return { transcribe, synthesize, formatForVoice };
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 2: Puter.js as a Server-Side Adapter (not browser-direct)
|
||||
The existing `/transcribe` handler in `chat-files.ts` already uses `promisify(execFile)` — this pattern is the right model. The service wraps it with format selection (`webm` vs `ogg`) and the same whisper-cpp → openai-whisper cascade.
|
||||
|
||||
**What:** Puter.js supports Node.js via `@heyputer/puter.js` with `init(authToken)`. The server acts as a proxy: it holds the Puter auth token (stored in `company_secrets` via the existing `secretService`), forwards chat requests to `puter.ai.chat({ stream: true })`, and pipes the async iterable back to the browser as SSE — exactly the same format the existing `useStreamingChat` hook already consumes.
|
||||
### Pattern 2: Thin Telegram Relay
|
||||
|
||||
**Why not browser-direct:** The existing chat architecture is server-mediated (all agent messages go through Express SSE). Bypassing this would require forking the streaming infrastructure. Using the server as proxy re-uses `useStreamingChat` unchanged and keeps the Puter token off the client.
|
||||
**What:** The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram.
|
||||
|
||||
**When to use:** During onboarding when user selects "Puter.js cloud" tier and authenticates. The Puter auth flow opens a browser popup (`puter.auth.signIn()` must be user-initiated from the UI), receives a token, then POSTs it to `/api/puter-proxy/auth` for server storage.
|
||||
**When to use:** Building a disposable bridge that will be replaced by a richer implementation later.
|
||||
|
||||
**Trade-offs:** One extra round-trip compared to browser-direct, but avoids token exposure and re-uses the existing SSE pipeline. Puter.js Node.js usage requires `@heyputer/puter.js` as a server dependency (not currently in the monorepo).
|
||||
**Trade-offs:** No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope.
|
||||
|
||||
**Example (server-side relay):**
|
||||
**Shape:**
|
||||
```typescript
|
||||
// server/src/routes/puter-proxy.ts
|
||||
router.post("/api/puter-proxy/chat", async (req, res) => {
|
||||
const token = await svc.getStoredToken(companyId);
|
||||
const puter = init(token);
|
||||
res.setHeader("Content-Type", "text/event-stream");
|
||||
const stream = await puter.ai.chat(req.body.messages, { stream: true });
|
||||
for await (const chunk of stream) {
|
||||
res.write(`data: ${JSON.stringify(chunk)}\n\n`);
|
||||
}
|
||||
res.end();
|
||||
// server/src/services/telegram.ts
|
||||
import { Bot } from "grammy";
|
||||
|
||||
export function telegramService(db: Db) {
|
||||
let bot: Bot | null = null;
|
||||
|
||||
function start(token: string): void; // idempotent, long-poll
|
||||
function stop(): void;
|
||||
function isRunning(): boolean;
|
||||
|
||||
return { start, stop, isRunning };
|
||||
}
|
||||
```
|
||||
|
||||
The bot calls `chatService(db)` and `puterProxyService(db)` directly — no HTTP round-trip to the same server.
|
||||
|
||||
### Pattern 3: Voice Mode Flag on Messages
|
||||
|
||||
**What:** Each message carries an optional `voiceMode: boolean` flag. When `true`, the server formats the response for voice (dual output: `voice` + `full`), and the client auto-plays TTS and shows the full text in a collapsible block.
|
||||
|
||||
**When to use:** Differentiating voice-initiated messages from text messages within the same conversation.
|
||||
|
||||
**Trade-offs:** Adds a field to `createMessageSchema` and the `ChatMessage` type. The field is optional and defaults to `false`, so existing messages and the upstream schema are not broken.
|
||||
|
||||
**Schema change:**
|
||||
```typescript
|
||||
// packages/shared/src/validators/chat.ts — additive only
|
||||
export const createMessageSchema = z.object({
|
||||
role: z.enum(["user", "assistant", "system"]),
|
||||
content: z.string().min(1).max(100_000),
|
||||
agentId: z.string().uuid().optional(),
|
||||
messageType: z.string().optional(),
|
||||
voiceMode: z.boolean().optional(), // NEW in v1.6
|
||||
});
|
||||
```
|
||||
|
||||
### Pattern 3: Whisper on Server, Piper in Browser (Hybrid Voice)
|
||||
### Pattern 4: Direct Service Calls in Telegram Bridge
|
||||
|
||||
**What:** Voice input (speech-to-text) runs server-side via `whisper-node` (Node.js bindings for whisper.cpp). The UI records audio via `MediaRecorder`, POSTs a blob to `POST /api/voice/transcribe`, and gets back a transcript string. Voice output (text-to-speech) uses `@mintplex-labs/piper-tts-web` which runs client-side via WebAssembly — no server round-trip needed for TTS.
|
||||
**What:** The Telegram bot does not call the Express HTTP API to get AI responses. It calls `chatService(db)` and `puterProxyService(db)` as regular TypeScript function calls within the same server process.
|
||||
|
||||
**Why this split:** whisper.cpp requires native binaries that work on CPU-only hardware, which the server controls. Piper TTS web runs via WASM in the browser and has no native dependency — this keeps TTS latency low (no network round-trip) and works even if the server is slow.
|
||||
**When to use:** Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip.
|
||||
|
||||
**When to use:** When user selects "voice mode" in onboarding (VoiceSetupStep). Whisper runs only if the user chooses a local Whisper model (downloaded to `data/whisper-models/`); as a fallback, the browser's native `webkitSpeechRecognition` / `SpeechRecognition` API is used.
|
||||
**Trade-offs:** Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool.
|
||||
|
||||
**Trade-offs:** Whisper download adds 75MB–1.5GB to first-run setup. For CPU-only hardware, whisper-tiny.en (75MB) transcribes in ~2s for a 10s clip on M4 — acceptable. Piper WASM download is ~20MB (models ~30-100MB each).
|
||||
**Why not HTTP:** A `fetch("http://localhost:PORT/api/...")` call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development.
|
||||
|
||||
**Example (voice input hook):**
|
||||
```typescript
|
||||
// ui/src/hooks/useVoiceInput.ts
|
||||
export function useVoiceInput() {
|
||||
// Records with MediaRecorder → blob → POST /api/voice/transcribe
|
||||
// Falls back to window.SpeechRecognition if whisper not configured
|
||||
}
|
||||
```
|
||||
### Pattern 5: grammY Long-Poll for Single-User Local Deployment
|
||||
|
||||
### Pattern 4: Persistent Memory via File-Backed JSON (No New DB Table)
|
||||
**What:** Use grammY `bot.start()` (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running.
|
||||
|
||||
**What:** The assistant memory store is a per-workspace JSON file at `data/memory/<companyId>.json`. Each memory entry has `{ id, content, createdAt, tags }`. The `memoryService` reads this file on startup (lazy-loaded per companyId), keeps it in-process, and writes on mutation. Memory injection works by prepending a formatted memory block to the system prompt at chat-send time in the existing chat service.
|
||||
**When to use:** Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain.
|
||||
|
||||
**Why not PostgreSQL:** Adding a new table violates the "no DB schema changes" constraint for upstream rebase safety. File-backed JSON with an in-process cache is fast for a single-user setup (sub-millisecond reads) and requires no migration.
|
||||
**Trade-offs:** Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use.
|
||||
|
||||
**When to use:** Personal AI Assistant mode only. Project Builder mode does not use the memory service.
|
||||
|
||||
**Trade-offs:** Not transactional. For a single-user local deployment, this is acceptable. File writes are atomic via write-then-rename pattern. Memory search is linear scan (no vector embeddings in v1.5 — semantic search is a future enhancement).
|
||||
|
||||
**Example:**
|
||||
```typescript
|
||||
// server/src/services/memory.ts
|
||||
export function memoryService() {
|
||||
const cache = new Map<string, MemoryStore>();
|
||||
|
||||
return {
|
||||
async inject(companyId: string, systemPrompt: string): Promise<string> {
|
||||
const store = await load(companyId);
|
||||
if (store.entries.length === 0) return systemPrompt;
|
||||
const block = store.entries.map(e => `- ${e.content}`).join("\n");
|
||||
return `${systemPrompt}\n\n## What I remember about you:\n${block}`;
|
||||
}
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 5: Onboarding State via instance_settings.general JSONB
|
||||
|
||||
**What:** All onboarding configuration (selected mode, voice config, active provider tier) is stored in the existing `instance_settings.general` JSONB column. The `instanceSettingsService` already handles arbitrary JSONB keys. Nexus adds its config under a `nexus` namespace key to avoid upstream key collisions.
|
||||
|
||||
**When to use:** Reading/writing onboarding mode, voice model selection, and provider tier configuration. No new table, no migration.
|
||||
|
||||
**Example:**
|
||||
```typescript
|
||||
// instance_settings.general.nexus = {
|
||||
// mode: "personal_ai" | "project_builder" | "both",
|
||||
// voiceModel: "whisper-tiny.en" | "whisper-base" | null,
|
||||
// piperVoice: "en_US-amy-medium" | null,
|
||||
// providerTier: "local" | "puter" | "oauth_gemini" | "api_key",
|
||||
// }
|
||||
```
|
||||
|
||||
### Pattern 6: OAuth Token Storage via Existing Secrets Service
|
||||
|
||||
**What:** OAuth tokens (Google Gemini, OpenAI) and the Puter.js auth token are stored via the existing `secretService` using the `local_encrypted` provider. The onboarding wizard calls `POST /api/companies/:id/secrets` with a well-known name (e.g., `nexus_puter_token`, `nexus_gemini_token`). Adapters read these at spawn time.
|
||||
|
||||
**When to use:** Any time an OAuth flow completes and a token needs persistence.
|
||||
|
||||
**Trade-offs:** Secrets are per-company (workspace), not per-instance. This is fine for single-user setup. The existing secrets UI lets users view/rotate tokens manually.
|
||||
**Lifecycle:**
|
||||
- Start: `nexusSettingsService().get()` finds `telegramToken` set → `telegramService(db).start(token)`
|
||||
- Stop: `server.close()` → `telegramService(db).stop()`
|
||||
- Runtime toggle: `POST /api/telegram/token` updates nexus-settings and calls start/stop
|
||||
|
||||
---
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Onboarding Wizard Data Flow
|
||||
### Web Voice Input Flow
|
||||
|
||||
```
|
||||
User opens Nexus (no workspace yet)
|
||||
↓
|
||||
NexusOnboardingWizard renders
|
||||
↓
|
||||
Step 1: ModeSelector → user picks "Personal AI" / "Project Builder" / "Both"
|
||||
↓
|
||||
Step 2: HardwareSummaryStep
|
||||
→ GET /api/hardware/info (new route)
|
||||
→ hardwareService.detect() → os.totalmem() + system_profiler
|
||||
→ returns { totalGb, gpuName, unifiedMemory, platform }
|
||||
→ wizard shows model tier recommendations
|
||||
↓
|
||||
Step 3: ProviderTierStep
|
||||
→ Local: already detected via existing Hermes probe
|
||||
→ Puter.js: user clicks "Connect" → puter.auth.signIn() popup
|
||||
→ UI POSTs token to POST /api/puter-proxy/auth
|
||||
→ server stores in secretService("nexus_puter_token")
|
||||
→ OAuth (Gemini/OpenAI): OAuth PKCE flow in a popup window
|
||||
→ callback captured by temp local server or redirect
|
||||
→ token stored via secretService
|
||||
→ API Key: direct input → stored via secretService
|
||||
↓
|
||||
Step 4: VoiceSetupStep (optional, skippable)
|
||||
→ GET /api/voice/status → check if whisper binary present
|
||||
→ User picks model → POST /api/voice/download (async download + SSE progress)
|
||||
→ User picks Piper voice → stored in instance_settings.general.nexus
|
||||
↓
|
||||
Step 5: OnboardingSummaryStep
|
||||
→ Creates workspace + agents (existing companiesApi + agentsApi flow)
|
||||
→ Writes nexus config to instance_settings.general.nexus
|
||||
→ Navigates to PersonalAssistant page OR Dashboard based on mode
|
||||
User holds mic button
|
||||
|
|
||||
v
|
||||
VoiceMicButton: MediaRecorder + AnalyserNode
|
||||
|
|
||||
v (silence detected after 1.5s or stop pressed)
|
||||
POST /api/transcribe {audio: webm blob}
|
||||
|
|
||||
v
|
||||
voice.ts route -> voicePipelineService.transcribe(buffer, "webm")
|
||||
|
|
||||
v (whisper-cpp or openai-whisper CLI via execFile)
|
||||
{ text: "transcribed text" }
|
||||
|
|
||||
v
|
||||
ChatInput fills textarea -> user sends (message tagged voiceMode: true)
|
||||
|
|
||||
v
|
||||
POST /conversations/:id/stream -> chatService + puterProxyService
|
||||
|
|
||||
v (SSE tokens arrive)
|
||||
ChatMessage with voice_mode badge + dual output (voice text + full text collapsible)
|
||||
|
|
||||
v
|
||||
TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5)
|
||||
```
|
||||
|
||||
### Personal AI Assistant Chat Data Flow
|
||||
### Server-Side TTS Flow (POST /synthesize)
|
||||
|
||||
```
|
||||
User types message in AssistantChatHub
|
||||
↓
|
||||
ChatInput → useStreamingChat.startStream(conversationId, message)
|
||||
↓
|
||||
POST /api/companies/:id/chat/conversations/:convId/messages
|
||||
↓ (existing chat route, no change)
|
||||
chat route detects "personal_assistant" agent type
|
||||
↓
|
||||
memoryService.inject(companyId, systemPrompt) ← NEW injection point
|
||||
↓
|
||||
Route selects provider based on instance_settings.general.nexus.providerTier:
|
||||
- "local" → existing Hermes adapter (no change)
|
||||
- "puter" → puterProxyService.chat() → Puter.js Node client → SSE relay
|
||||
- "oauth_*" → respective provider API with stored OAuth token → SSE relay
|
||||
↓
|
||||
SSE events stream to UI via existing /api/chat/stream endpoint pattern
|
||||
↓
|
||||
useStreamingChat receives chunks → ChatMessageList renders them
|
||||
POST /api/synthesize { text, voiceId? }
|
||||
|
|
||||
v
|
||||
voice.ts route -> voicePipelineService.synthesize(text)
|
||||
|
|
||||
v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout)
|
||||
Response: Content-Type audio/wav, Buffer body
|
||||
|
|
||||
v
|
||||
Client: new Audio(URL.createObjectURL(blob)).play()
|
||||
```
|
||||
|
||||
### Voice Input Data Flow
|
||||
Note: Server-side `/synthesize` is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side `usePiperTts` WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward.
|
||||
|
||||
### Telegram Text Message Flow
|
||||
|
||||
```
|
||||
User presses mic button in ChatInput (MODIFIED)
|
||||
↓
|
||||
useVoiceInput starts MediaRecorder → records WebM/Opus blob
|
||||
↓
|
||||
User releases mic → blob POSTed to POST /api/voice/transcribe
|
||||
↓
|
||||
voiceService.transcribe(audioBuffer)
|
||||
→ whisper-node.transcribe(path) → returns text
|
||||
↓
|
||||
Text injected into ChatInput.value
|
||||
↓
|
||||
User reviews → sends normally
|
||||
Telegram user sends text
|
||||
|
|
||||
v
|
||||
grammY bot.on("message:text") handler
|
||||
|
|
||||
v
|
||||
telegramService: resolveOrCreateConversation(db)
|
||||
|
|
||||
v
|
||||
chatService(db).addMessage(conversationId, { role: "user", content: text })
|
||||
|
|
||||
v
|
||||
telegramService: collect full response via puterProxyService(db).chatStream()
|
||||
|
|
||||
v (if voiceMode !== "full_voice")
|
||||
ctx.reply("[AgentName]: full_response_text")
|
||||
|
||||
| (if voiceMode === "full_voice")
|
||||
v
|
||||
voicePipelineService.formatForVoice(response) -> { voice, full }
|
||||
ctx.reply("[AgentName]: " + full) -- text message with full details
|
||||
|
|
||||
v
|
||||
voicePipelineService.synthesize(voice) -> WAV Buffer
|
||||
ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg"))
|
||||
```
|
||||
|
||||
### npx buildthis Data Flow
|
||||
### Telegram Voice Message Flow
|
||||
|
||||
```
|
||||
Developer runs: npx buildthis
|
||||
↓
|
||||
buildthis/src/index.ts checks for running Nexus:
|
||||
GET http://localhost:4000/api/health → 200?
|
||||
YES → open browser to http://localhost:4000
|
||||
NO → run nexus onboard wizard (delegates to paperclipai onboard)
|
||||
OR detect Docker → suggest docker-compose up
|
||||
Telegram user sends voice note (OGG Opus format)
|
||||
|
|
||||
v
|
||||
grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer
|
||||
|
|
||||
v
|
||||
voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text
|
||||
|
|
||||
v
|
||||
(same path as Telegram text message above)
|
||||
```
|
||||
|
||||
---
|
||||
### nexus-settings Schema Evolution
|
||||
|
||||
## Integration Points: New vs Modified
|
||||
```
|
||||
v1.5: { mode, voiceEnabled }
|
||||
v1.6: { mode, voiceEnabled, voiceMode, telegramToken }
|
||||
|
||||
### Server Routes — app.ts (MODIFIED)
|
||||
|
||||
One file to add 4 route mounts. Minimal conflict surface with upstream.
|
||||
|
||||
```typescript
|
||||
// In server/src/app.ts — add after ollamaRoutes():
|
||||
app.use(hardwareRoutes());
|
||||
app.use(voiceRoutes());
|
||||
app.use(memoryRoutes(db));
|
||||
app.use(puterProxyRoutes(db));
|
||||
voiceMode: "text" | "voice_input" | "full_voice" (default: "text")
|
||||
telegramToken: string | undefined (set by user via UI or POST /telegram/token)
|
||||
```
|
||||
|
||||
### Chat Route — MODIFIED for memory injection
|
||||
|
||||
The existing chat service (`server/src/services/chat.ts`) needs one injection point: when building the system prompt for a conversation, call `memoryService.inject()`. This is scoped to conversations where the agent has `adapterConfig.assistantMode === true`.
|
||||
|
||||
**Risk:** This touches an upstream file. The injection is a 3-line addition inside the message-send handler. Low conflict probability — upstream rarely modifies this section.
|
||||
|
||||
### NexusOnboardingWizard.tsx — REPLACED
|
||||
|
||||
The current single-step wizard becomes a multi-step wizard. Since this file is already a Nexus replacement (not an upstream file), there is zero conflict risk — it will never exist in upstream.
|
||||
|
||||
### App.tsx routing — MODIFIED (one new route)
|
||||
|
||||
Add the `PersonalAssistant` page as a new lazy-loaded route. Minimal upstream conflict (routing section rarely changes).
|
||||
|
||||
### ChatInput.tsx — MODIFIED (voice button)
|
||||
|
||||
Add a microphone button that triggers `useVoiceInput`. This is an upstream file — the modification is additive (new button, no existing logic changed). Conflict risk: LOW, as upstream rarely modifies ChatInput.
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Anti-Pattern 1: Browser-Direct Puter.js
|
||||
|
||||
**What people do:** Import `@heyputer/puter.js` in the React frontend and call `puter.ai.chat()` directly from the browser.
|
||||
|
||||
**Why it's wrong:** Exposes the Puter auth token in browser storage/network. Bypasses the existing SSE pipeline, requiring a second streaming implementation. Breaks the memory injection pattern (no server-side hook). Cannot use the existing `useStreamingChat` hook.
|
||||
|
||||
**Do this instead:** Use the server proxy pattern (Pattern 2). The UI sends messages to `/api/puter-proxy/chat` exactly like any other chat endpoint.
|
||||
|
||||
### Anti-Pattern 2: New PostgreSQL Tables for Memory
|
||||
|
||||
**What people do:** Create a `assistant_memories` migration with a proper relational schema.
|
||||
|
||||
**Why it's wrong:** Violates the hard constraint: no DB migrations, no schema changes, to keep upstream rebase clean. A migration file created in Nexus will conflict every time upstream adds a migration.
|
||||
|
||||
**Do this instead:** File-backed JSON in the server's data directory (Pattern 4). The single-user M4 Mini deployment will never hit performance limits with this approach.
|
||||
|
||||
### Anti-Pattern 3: Multi-Step Wizard as Modified OnboardingWizard.tsx
|
||||
|
||||
**What people do:** Modify the upstream `OnboardingWizard.tsx` directly to add v1.5 steps.
|
||||
|
||||
**Why it's wrong:** The upstream wizard is actively maintained (120+ upstream commits since fork). Touching it creates guaranteed rebase conflicts.
|
||||
|
||||
**Do this instead:** Continue the existing pattern — `NexusOnboardingWizard.tsx` is already the Nexus replacement via Vite alias. All v1.5 changes go there. Upstream file untouched.
|
||||
|
||||
### Anti-Pattern 4: OAuth in the Browser via Redirect
|
||||
|
||||
**What people do:** Redirect the main app window to the OAuth provider and handle the callback via `window.location`.
|
||||
|
||||
**Why it's wrong:** Loses React state mid-flow. Hard to handle callback URL in a local server that may not have a publicly routable HTTPS endpoint.
|
||||
|
||||
**Do this instead:** Use a popup window for OAuth (`window.open`). The popup handles the full OAuth redirect. On callback, the popup calls `window.opener.postMessage` with the token, closes itself, and the main window receives it. For Puter.js specifically, `puter.auth.signIn()` handles the popup internally.
|
||||
`voiceMode` is a workspace-level setting (not per-agent). The three states map to:
|
||||
- `"text"`: mic button transcribes to text input, TTS manual-only, Telegram text-only
|
||||
- `"voice_input"`: mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out
|
||||
- `"full_voice"`: mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out
|
||||
|
||||
---
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
This is a single-user local deployment on an M4 Mini. Scaling is not a concern for v1.5. The architecture is designed for correctness and upstream merge-ability, not horizontal scale.
|
||||
This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility.
|
||||
|
||||
| Concern | Single User (M4 Mini) |
|
||||
|---------|----------------------|
|
||||
| Hardware detection | os.totalmem() + sync shell probe, cached 5min — negligible |
|
||||
| Puter.js relay | One connection at a time, no pooling needed |
|
||||
| Whisper transcription | ~2s for 10s clip on M4, sequential queue sufficient |
|
||||
| Memory store | File JSON, <10ms read, no contention |
|
||||
| Voice TTS | WASM in browser, zero server load |
|
||||
| Concern | At 1 user (target) | Notes |
|
||||
|---------|-------------------|-------|
|
||||
| STT latency | whisper-cpp base.en on M4: ~1-3s | Acceptable; shows transcribing spinner |
|
||||
| TTS latency | piper CLI on M4: ~0.3-1s for short text | <3s target met |
|
||||
| Telegram poll | grammY `bot.start()`, 1 process | Adequate for <5,000 msgs/hour |
|
||||
| Memory overhead | ~10-20MB for polling loop | Acceptable on 16GB+ M4 |
|
||||
| Piper model | First server-side synthesize: cold start | Piper loads model into memory; subsequent calls fast |
|
||||
|
||||
---
|
||||
|
||||
## Build Order (Dependency Graph)
|
||||
## Anti-Patterns
|
||||
|
||||
The build order matters because later phases consume services built in earlier ones.
|
||||
### Anti-Pattern 1: Telegram-Specific Voice Logic
|
||||
|
||||
```
|
||||
Phase 1: Hardware Detection
|
||||
→ hardwareService (server)
|
||||
→ GET /api/hardware/info (route)
|
||||
→ useHardwareInfo hook (UI)
|
||||
→ HardwareSummaryStep component (UI)
|
||||
No dependencies on other new phases.
|
||||
**What people do:** Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler.
|
||||
|
||||
Phase 2: Provider Tiers (depends on Phase 1 for display)
|
||||
→ puterProxyService (server) — Puter.js Node client
|
||||
→ secretService integration for token storage (uses EXISTING service)
|
||||
→ POST /api/puter-proxy/auth (route)
|
||||
→ ProviderTierStep component (UI)
|
||||
→ OAuth popup flow (UI)
|
||||
**Why it's wrong:** Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation.
|
||||
|
||||
Phase 3: Multi-Step Onboarding Wizard (depends on Phases 1+2)
|
||||
→ ModeSelector, OnboardingSummaryStep components (UI)
|
||||
→ Refactor NexusOnboardingWizard.tsx into multi-step
|
||||
→ instance_settings.general.nexus config write
|
||||
**Do this instead:** All voice processing goes through `voicePipelineService`. The Telegram handler calls `transcribe(buf, "ogg")` — the service handles format differences. The web route calls `transcribe(buf, "webm")` — same service, different format argument.
|
||||
|
||||
Phase 4: Persistent Memory + Assistant Mode (depends on Phase 3)
|
||||
→ memoryService (server)
|
||||
→ Memory injection in chat route (MODIFIED — highest risk step)
|
||||
→ GET/POST/DELETE /api/companies/:id/memory (routes)
|
||||
→ PersonalAssistantPage (UI)
|
||||
→ useAssistantMemory hook (UI)
|
||||
### Anti-Pattern 2: Circular HTTP Call for Telegram AI Response
|
||||
|
||||
Phase 5: Voice (depends on Phase 3, independent of Phase 4)
|
||||
→ voiceService (server) — whisper-node + piper setup
|
||||
→ POST /api/voice/transcribe, /speak, /status (routes)
|
||||
→ VoiceSetupStep in onboarding (UI)
|
||||
→ useVoiceInput, useVoiceSpeech hooks (UI)
|
||||
→ ChatInput microphone button (MODIFIED — upstream file, low risk)
|
||||
**What people do:** Telegram bot handler calls `fetch("http://localhost:PORT/api/conversations/:id/stream")` to get AI responses from within the same server process.
|
||||
|
||||
Phase 6: npx buildthis (independent of all above)
|
||||
→ packages/buildthis/ new package
|
||||
→ package.json bin field setup
|
||||
→ npm publish configuration
|
||||
```
|
||||
**Why it's wrong:** Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running.
|
||||
|
||||
**Recommended sequence:** 1 → 2 → 3 → 4 → 5 → 6. Phase 4 (memory injection into chat route) is the highest-risk upstream-file modification and should come after onboarding is validated.
|
||||
**Do this instead:** `telegramService` imports `chatService(db)` and `puterProxyService(db)` directly. Collect tokens from the async generator into a string, then send to Telegram as a single message.
|
||||
|
||||
### Anti-Pattern 3: Blocking grammY on Slow CLI Processes
|
||||
|
||||
**What people do:** `await synthesize()` inside a bot handler with no timeout, assuming piper is always available and fast.
|
||||
|
||||
**Why it's wrong:** If the `piper` binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely.
|
||||
|
||||
**Do this instead:** Wrap CLI calls in a `Promise.race([piperCall, timeout(8_000)])`. If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode.
|
||||
|
||||
### Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts
|
||||
|
||||
**What people do:** Leave the STT handler in `chat-files.ts` and call `voicePipelineService` from there, adding Nexus-specific logic to an upstream-sourced file.
|
||||
|
||||
**Why it's wrong:** `chat-files.ts` is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface.
|
||||
|
||||
**Do this instead:** Move `/transcribe` and `/synthesize` to a new `voice.ts` route file (Nexus-only, never in upstream). Keep `chat-files.ts` as close to upstream as possible.
|
||||
|
||||
### Anti-Pattern 5: Storing Telegram Token in Database
|
||||
|
||||
**What people do:** Create a new DB table or add a column to `instance_settings` to store the Telegram bot token.
|
||||
|
||||
**Why it's wrong:** Any DB schema change blocks upstream rebase (migration files conflict). The `nexus-settings.json` file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent.
|
||||
|
||||
**Do this instead:** Store `telegramToken` in `nexus-settings.json` via the existing `nexusSettingsService`. Same pattern as `voiceEnabled`, `mode`.
|
||||
|
||||
---
|
||||
|
||||
## Integration Points: External Services
|
||||
## Integration Points
|
||||
|
||||
| Service | Integration Pattern | Auth Storage | Notes |
|
||||
|---------|---------------------|--------------|-------|
|
||||
| Puter.js | Server-side Node.js client proxy, SSE relay | `company_secrets` table | Token obtained via browser popup on first connect |
|
||||
| Google Gemini OAuth | PKCE popup flow, access token + refresh token | `company_secrets` table | Policy risk: using Gemini CLI OAuth with third-party apps may trigger abuse detection — use only if user has an active Gemini subscription |
|
||||
| OpenAI OAuth | PKCE flow via auth.openai.com | `company_secrets` table | Only for free tier / ChatGPT Plus users |
|
||||
| Whisper (whisper-node) | Native binary, spawned by voiceService | N/A — local binary | Download on first use, cached in data/whisper-models/ |
|
||||
| Piper TTS | @mintplex-labs/piper-tts-web WASM, runs in browser | N/A — client-side | Model files downloaded to browser cache |
|
||||
| Ollama | Existing integration (v1.4) — no changes | N/A | ollama.ts service and /ollama routes unchanged |
|
||||
### External Services
|
||||
|
||||
| Service | Integration Pattern | Notes |
|
||||
|---------|---------------------|-------|
|
||||
| Telegram Bot API | grammY `bot.start()` long-polling (Node.js) | No public URL required; polling starts on server boot if token present in nexus-settings |
|
||||
| whisper-cpp / openai-whisper | `execFile` cascade (same as existing `/transcribe`) | Format argument added: writes `.webm` or `.ogg` temp file based on input |
|
||||
| piper TTS binary | `child_process.spawn` stdin -> stdout | Text piped to stdin; WAV or raw PCM bytes collected from stdout |
|
||||
|
||||
### Internal Boundaries
|
||||
|
||||
| Boundary | Communication | Notes |
|
||||
|----------|---------------|-------|
|
||||
| voice route <-> voicePipelineService | Direct function call | Route is thin HTTP wrapper; all logic in service |
|
||||
| telegram service <-> voicePipelineService | Direct function call | Same service used by both transports |
|
||||
| telegram service <-> chatService | Direct function call | Bot calls `chatService(db)` directly — no HTTP round-trip |
|
||||
| telegram service <-> nexusSettingsService | Direct function call | Reads `voiceMode` and `telegramToken` at start and on each message |
|
||||
| web UI <-> voice route | REST: `POST /api/transcribe`, `POST /api/synthesize` | Web client uses browser-side piper WASM for TTS; `/synthesize` primarily for Telegram |
|
||||
| UI VoiceModeToggle <-> nexus-settings | REST: `PATCH /api/nexus-settings` | Reads/writes `voiceMode` setting |
|
||||
|
||||
---
|
||||
|
||||
## Build Order
|
||||
|
||||
Based on component dependencies, the recommended build order within this milestone:
|
||||
|
||||
| Step | Component(s) | Reason |
|
||||
|------|-------------|--------|
|
||||
| 1 | `nexus-settings` schema extensions (`voiceMode`, `telegramToken`) | Everything downstream reads settings |
|
||||
| 2 | `voicePipelineService` | Backs all voice. No new deps. Independently testable. |
|
||||
| 3 | `voice.ts` route (`POST /transcribe`, `POST /synthesize`) | Thin wrapper. Register in `app.ts`. Move handler from chat-files. |
|
||||
| 4 | `VoiceMicButton` + `WaveformDisplay` + `useSilenceDetection` | Pure UI. Depends only on `/transcribe`. |
|
||||
| 5 | `VoiceModeToggle` + `useVoiceMode` | Depends on `voiceMode` in nexus-settings schema (Step 1). |
|
||||
| 6 | `ChatMessage` dual output | Depends on `voiceMode` in shared `ChatMessage` type. |
|
||||
| 7 | `createMessageSchema` + `ChatMessage` type (`voiceMode` flag) | Shared package change. Required by Steps 5-6. Could move earlier. |
|
||||
| 8 | `telegramService` | Depends on voicePipelineService (2), chatService (existing), nexusSettings (1). |
|
||||
| 9 | `telegram.ts` route + app.ts registration | Management endpoints. Needs telegramService. |
|
||||
| 10 | Onboarding STT/TTS hardware detection step | Final: wires all voice detection into onboarding flow. |
|
||||
|
||||
Steps 4-6 can run in parallel with Steps 7-9 if split across phases.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- Codebase inspection: `/opt/nexus/server/src/`, `/opt/nexus/ui/src/`, `/opt/nexus/packages/`
|
||||
- Puter.js Node.js support: https://docs.puter.com/supported-platforms/
|
||||
- Puter.js chat streaming API: https://docs.puter.com/AI/chat/
|
||||
- Puter.js auth flow: https://developer.puter.com/blog/browser-based-auth-puter-js-node/
|
||||
- whisper-node npm package: https://www.npmjs.com/package/whisper-node
|
||||
- Piper TTS WASM: https://www.npmjs.com/package/@mintplex-labs/piper-tts-web
|
||||
- @xenova/transformers Node.js audio guide: https://huggingface.co/docs/transformers.js/main/en/guides/node-audio-processing
|
||||
- Google Gemini OAuth: https://ai.google.dev/gemini-api/docs/oauth
|
||||
- Google Gemini OAuth policy risk: https://github.com/google-gemini/gemini-cli/issues/21866
|
||||
- Vectra local vector DB (future memory enhancement): https://github.com/Stevenic/vectra
|
||||
- Apple Silicon unified memory: https://eclecticlight.co/2022/03/01/making-sense-of-m1-memory-use/
|
||||
- Direct codebase inspection: `server/src/routes/chat-files.ts` (lines 297-386), `server/src/routes/chat.ts`, `server/src/services/nexus-settings.ts`, `server/src/app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `ui/src/components/TtsButton.tsx`, `ui/src/hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts`
|
||||
- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
|
||||
- `.planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md` — existing voice implementation details, WASM TTS pattern
|
||||
- [grammY documentation](https://grammy.dev/) — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks
|
||||
- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling recommended for single-user local; Express integration pattern
|
||||
- [rhasspy/piper (archived)](https://github.com/rhasspy/piper) — CLI: `echo "text" | piper --model voice.onnx -f -`; development moved to OHF-Voice/piper1-gpl Oct 2025
|
||||
- grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed
|
||||
|
||||
---
|
||||
*Architecture research for: Nexus v1.5 Smart Onboarding + Personal AI Assistant*
|
||||
*Researched: 2026-04-02*
|
||||
*Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6)*
|
||||
*Researched: 2026-04-03*
|
||||
|
|
|
|||
|
|
@ -1,22 +1,30 @@
|
|||
# Feature Research
|
||||
|
||||
**Domain:** Smart Onboarding + Personal AI Assistant (Nexus v1.5)
|
||||
**Researched:** 2026-04-02
|
||||
**Confidence:** MEDIUM overall — Puter.js confirmed current, hardware detection patterns confirmed, personal AI assistant patterns from active ecosystem; UX recommendations inferred from patterns
|
||||
**Domain:** Voice Pipeline (Whisper STT + Piper TTS) + Telegram Bridge (Nexus v1.6)
|
||||
**Researched:** 2026-04-03
|
||||
**Confidence:** MEDIUM-HIGH — STT/TTS pipeline patterns are well-documented; Telegram bot API is stable; dual-output formatting and voice mode UX patterns inferred from ChatGPT/Meta AI voice implementations and community patterns
|
||||
|
||||
---
|
||||
|
||||
## Milestone Scope
|
||||
|
||||
This document covers only the NEW features in v1.5. Existing features (NexusOnboardingWizard, Hermes adapter, Ollama integration, chat interface, PWA, voice input via Whisper) are already built and are dependencies, not deliverables.
|
||||
This document covers only the NEW features in v1.6. The following are already built and are dependencies, not deliverables:
|
||||
|
||||
- VoiceRecordButton with MediaRecorder API in ChatInput (v1.3)
|
||||
- TtsButton with @mintplex-labs/piper-tts-web WASM synthesis (v1.3/v1.5)
|
||||
- POST /transcribe endpoint with whisper-cpp/openai-whisper cascade (v1.3)
|
||||
- VoiceStep in onboarding wizard (v1.5)
|
||||
- voiceEnabled in nexus-settings (v1.5)
|
||||
- Full chat system with streaming SSE (v1.3)
|
||||
|
||||
**New features being researched:**
|
||||
- Hardware detection with pre-built model database
|
||||
- Tiered provider setup: local (Ollama) → zero-config cloud (Puter.js) → OAuth cloud (Gemini, OpenAI) → API key / subscription (Hermes, Claude Code, OpenClaw)
|
||||
- Personal AI Assistant mode with persistent memory, MCP connections, voice (Whisper + Piper)
|
||||
- Project handoff: assistant conversation → PM agent with context transfer
|
||||
- `npx buildthis` CLI entry point
|
||||
- Every step skippable
|
||||
- Transport-agnostic voice pipeline (server-side, not just browser WASM)
|
||||
- Voice mode flag on messages (affects response formatting)
|
||||
- Dual output pattern: voice-optimized prose + full markdown text
|
||||
- Web chat voice UI improvements: silence detection, waveform, auto-submit
|
||||
- Web chat audio playback: inline player, auto-play toggle
|
||||
- Voice mode toggle setting (text only / voice input / full voice)
|
||||
- Minimal Telegram bridge: single bot, text + voice relay, agent prefixing
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -24,130 +32,131 @@ This document covers only the NEW features in v1.5. Existing features (NexusOnbo
|
|||
|
||||
### Table Stakes (Users Expect These)
|
||||
|
||||
Features users assume exist in a modern AI onboarding flow. Missing these makes onboarding feel broken or untrustworthy.
|
||||
Features users assume exist when voice or Telegram is mentioned. Missing these makes the feature feel broken or incomplete.
|
||||
|
||||
| Feature | Why Expected | Complexity | Notes |
|
||||
|---------|--------------|------------|-------|
|
||||
| Hardware auto-detection on first run | Any local AI tool probes GPU/RAM; users expect "it just knows" | MEDIUM | Node.js can read `/proc/meminfo`, spawn `nvidia-smi`, detect Apple Silicon via `os.arch()`; Ollama's `/api/tags` endpoint also reveals loaded models |
|
||||
| RAM-aware model recommendations | Ollama and LM Studio both do this; users have been trained to expect it | LOW | Pre-built lookup table: <8GB RAM → 3B-7B, 8-16GB → 7B-13B, 16GB+ → 30B+; VRAM takes priority over system RAM |
|
||||
| Step-skippable onboarding | Any wizard that forces completion feels hostile; Clerk, Vercel, and Postman all allow skip | LOW | Each step needs a "skip" or "set up later" affordance; final summary shows what was skipped |
|
||||
| Progress indicator | Multi-step wizards without progress indicators cause anxiety ("how many more steps?") | LOW | Step counter or progress bar; 5-7 max steps total |
|
||||
| Summary screen before entering app | Users need to understand what was set up before being dropped in the dashboard | LOW | Show: mode selected, provider configured, models available; "Start chatting" CTA |
|
||||
| "Test connection" before saving | Every API key entry form should validate before proceeding | LOW | Quick `/health` or echo call to configured provider; show latency |
|
||||
| Persisted onboarding state | Refreshing mid-wizard should not restart from step 1 | LOW | LocalStorage or DB; existing NexusOnboardingWizard already handles this pattern |
|
||||
| Voice input/output toggle | Users who selected voice features expect them to work immediately | MEDIUM | Whisper already exists (v1.3); Piper TTS is the new addition; toggle in assistant settings |
|
||||
| Persistent conversation memory | Any "personal AI assistant" product ships some form of memory (ChatGPT, Claude Projects, Gemini) | HIGH | Users compare against ChatGPT memory; table stakes for the mode to feel meaningful |
|
||||
| MCP-style external connections | Power users expect the assistant to connect to their tools (files, git, search) | MEDIUM | MCP is now a universal standard (Anthropic, OpenAI, Google all adopted it); STDIO and HTTP transport both needed |
|
||||
| Silence-based auto-submit | Every voice input UI (Siri, Google, Whisper demos) stops recording on silence; holding a button feels archaic | MEDIUM | WebRTC VAD or AudioWorklet amplitude monitoring; 1.5s silence threshold typical; must show countdown so user knows what's happening |
|
||||
| Waveform/amplitude visualization while recording | Users expect visual feedback that the mic is active; a static "recording..." text feels broken | LOW | Canvas or SVG with 30-50 data points; AnalyserNode from Web Audio API; real-time amplitude bars, not pre-rendered waveform |
|
||||
| Voice response auto-play toggle | If the AI responded with audio, playing it automatically is expected unless the user disabled it; manual play-only feels incomplete | LOW | Boolean setting in nexus-settings (voiceAutoPlay); inline HTML5 `<audio>` element is sufficient; Web Audio API not needed |
|
||||
| Markdown-free voice responses | Users who hear responses read aloud expect prose sentences, not "asterisk asterisk bold asterisk asterisk code block triple backtick" spoken aloud | MEDIUM | Requires voice mode flag on the message sent to LLM; system prompt addendum: "respond in natural spoken prose, no markdown symbols, no bullet points, no code blocks unless the user explicitly asks"; dual output requires separate LLM pass or post-processing strip |
|
||||
| Telegram text relay to existing chat | Sending a text message to the Telegram bot and receiving the agent's reply is the core use case; anything less is not a bridge | MEDIUM | Telegraf (Node.js) as bot framework; message forwarded to existing chat API endpoint; response prefixed with agent name |
|
||||
| Telegram voice message transcription | Telegram users frequently send voice notes; a bridge that ignores voice messages frustrates mobile users immediately | MEDIUM | Telegram sends voice as OGG/Opus; download → convert (ffmpeg) → POST /transcribe → forward text to agent → reply with text (+ optionally TTS audio back) |
|
||||
| Agent identity visible in Telegram replies | When multiple agents can respond, the user must know who is replying | LOW | Simple text prefix: `[Hermes] Your answer here`; consistent format across all messages |
|
||||
| Recording state visible in UI | Users must be able to tell when recording is active vs. idle vs. processing | LOW | Three states in mic button: idle (mic icon), recording (red pulsing), processing (spinner); state machine pattern |
|
||||
|
||||
### Differentiators (Competitive Advantage)
|
||||
|
||||
Features that make Nexus v1.5 worth using over ChatGPT, Claude Projects, or bare Ollama.
|
||||
Features that make v1.6's voice and Telegram features worth using, beyond baseline functionality.
|
||||
|
||||
| Feature | Value Proposition | Complexity | Notes |
|
||||
|---------|-------------------|------------|-------|
|
||||
| Puter.js as zero-config cloud tier | No API key, no sign-up, 500+ models including GPT-4.1, Claude Sonnet 4, Gemini 2.5 — user pays via their Puter account | MEDIUM | Puter uses a "user-pays" model: each user authenticates against Puter and consumes their own credits. Developer (Mikkel) pays nothing. Implementation: drop in `puter.js` script, call `puter.ai.chat()`; requires user to have/create a Puter account (free tier exists) |
|
||||
| Local-first framed as privacy premium | Most tools push cloud. Nexus frames local Ollama as the privacy-respecting choice, not the budget option | LOW | Copy/UX decision: "Your data never leaves your machine" for local tier. No code change needed |
|
||||
| Hardware detection → instant model recommendation | Instead of listing 100 models and asking the user to pick, Nexus says "Given your M4 Mac Mini with 16GB unified memory, we recommend llama3.2:3b for assistant tasks" | MEDIUM | Pre-built model database (JSON lookup): Apple Silicon tiers, NVIDIA VRAM tiers, AMD VRAM tiers, CPU-only tier. Cross-reference with Ollama model library metadata |
|
||||
| Project handoff from assistant to PM agent | "Turn this conversation into a project" — one button to create a Paperclip Project with issues extracted from the conversation, with full chat context transferred to PM agent | HIGH | Novel UX pattern; no off-the-shelf solution; requires: summary extraction from conversation (LLM call), Project entity creation via existing API, agent prompt injection with context summary |
|
||||
| `npx buildthis` CLI entry point | Zero-install UX: `npx buildthis` downloads and runs the Nexus server + opens browser. Same pattern as `create-react-app`, `shadcn`, etc. | MEDIUM | Commander.js CLI already exists; `npx` entry requires: `bin` field in package.json, published to npm (or private registry), auto-open browser after server starts |
|
||||
| Voice + local LLM = fully offline assistant | Whisper (STT) + Piper TTS + Ollama (LLM) = zero cloud dependency for voice interaction. Rare in consumer tools | HIGH | Piper is CPU-capable, fast enough on Apple Silicon. Integration complexity: audio pipeline (mic → Whisper → Ollama → Piper → speaker); streaming TTS for lower latency |
|
||||
| Mode selection: Personal AI / Project Builder / Both | Most tools are either a chat assistant or a project manager. Nexus surfaces both modes with explicit switching | LOW | UI mode toggle stored in workspace settings; affects which features are surfaced in sidebar/dashboard |
|
||||
| Google OAuth cloud tier (no API key) | Users with Google accounts can use Gemini without managing API keys — mirrors how Opencode handles Gemini OAuth | MEDIUM | Google OAuth flow → exchange for short-lived AI Studio token; already proven pattern in Opencode |
|
||||
| Transport-agnostic voice pipeline | Voice processing works identically for browser input, Telegram voice notes, and future CLI/API callers; no duplication of Whisper/Piper logic | MEDIUM | Abstract to a `VoicePipelineService`: `transcribe(audioBuffer) → text`, `synthesize(text, voice?) → audioBuffer`; HTTP endpoints call the service; Telegram bot calls the same service |
|
||||
| Dual output pattern | AI responds with two representations: short spoken-prose version (for TTS/Telegram) and full markdown version (for web chat, copy-paste, code); user sees both where appropriate | HIGH | Prompt engineering: "Provide a SPOKEN response (1-3 sentences, no markdown) and a DETAILED response (full markdown). Format: SPOKEN: … DETAILED: …"; parse and split in middleware; store both in message metadata |
|
||||
| Sentence-buffered TTS streaming | Start playing the first sentence while the second is still synthesizing; reduces perceived latency vs. waiting for full response | MEDIUM | Split response on `.!?`; Piper synthesizes sentence 1, audio starts playing; meanwhile sentence 2 begins synthesis; append chunks to audio queue |
|
||||
| Voice mode flag preserves context | Messages tagged with `voice_mode: true` in the DB let the UI, Telegram bridge, and future Command Center all render correctly without re-inferring intent | LOW | Add `source` field or `voice_mode` boolean to message metadata; already-existing message schema likely supports metadata/extras column |
|
||||
| Telegram as thin relay (not a separate chat product) | The Telegram bot forwards to the existing Nexus chat engine; responses use the full agent intelligence already configured; no separate bot personality to maintain | LOW | Relay pattern: Telegram message → POST /api/workspaces/:id/chat/messages → SSE stream → collect full response → reply to Telegram; agent prefixing is presentation only |
|
||||
| Language auto-detection in STT | Whisper natively detects language without configuration; relay this info back to the UI so the user knows what language was detected | LOW | Whisper returns `language` in its JSON output; pass through to transcript response; log in message metadata; no user config needed for common languages |
|
||||
|
||||
### Anti-Features (Commonly Requested, Often Problematic)
|
||||
|
||||
Features that seem like good additions but create maintenance debt, scope creep, or user confusion.
|
||||
|
||||
| Feature | Why Requested | Why Problematic | Alternative |
|
||||
|---------|---------------|-----------------|-------------|
|
||||
| "Sync memory to cloud" for personal assistant | Users want memory accessible across devices | Requires auth system, cloud storage, privacy policy, GDPR compliance — enormous scope for a personal tool | Local SQLite memory is sufficient for Mac Mini single-user; defer cloud sync to a future milestone |
|
||||
| Automatic MCP server discovery | Users want zero-config MCP like Bluetooth discovery | MCP servers expose arbitrary capabilities; auto-discovery without user approval is a security risk | Curated list of common MCP servers (filesystem, git, web search) with one-click add; user approves each |
|
||||
| Real-time provider cost display during chat | Visible per-message token cost feels responsive | Puter.js explicitly does not expose cost to developer (user-pays model); cost calculation would require hardcoding token prices that drift | Show estimated costs for API-key providers only; for Puter.js, show "costs charged to your Puter account" |
|
||||
| Streaming TTS (word-by-word) | Reduces perceived latency of voice responses | Browser audio API makes true word-by-word streaming complex; sentence-by-sentence is the practical optimum | Buffer by sentence (split on `.!?`); start playing first sentence while next is synthesizing |
|
||||
| Multi-user onboarding / team setup | Looks natural to "extend" to teams | Nexus is intentionally single-user (Mac Mini, local_trusted mode); team features require auth overhaul | Explicitly document single-user scope; defer team features until upstream Paperclip ships them |
|
||||
| AI provider auto-negotiation (pick best available) | Transparent provider switching sounds smart | Silent model switches confuse users ("why did my assistant suddenly get dumber?"); debugging becomes impossible | Show active provider in UI always; let user set preferred priority order; never switch silently |
|
||||
| Real-time speech-to-speech streaming | Feels like a "next level" voice experience | Requires full-duplex WebSocket audio, interrupt handling, turn-taking logic, VAD on both ends — an entirely different architecture (Pipecat, LiveKit); out of scope for a relay bridge | Sequential pipeline (speak → wait → hear) is sufficient for assistant use cases; real-time is only needed for phone-call-style interaction |
|
||||
| Per-agent Telegram bots | "My PM agent should have its own bot handle" | Multiple bots means multiple bot tokens, multiple webhook registrations, complex routing when agents hand off to each other; maintenance nightmare | Single bot with agent name prefix in messages: `[PM] Here is your sprint plan`; PROJECT.md explicitly out-of-scopes this |
|
||||
| Deep Telegram ↔ web chat sync | "I want to see Telegram messages in the web UI" | Real-time bidirectional sync requires a shared event bus (Postgres LISTEN/NOTIFY or Redis pub/sub), session management across transports, and conflict resolution; PROJECT.md explicitly defers this to "Postgres bus" future milestone | Relay is one-way per session: Telegram message → agent → Telegram reply; web chat is a separate session |
|
||||
| Wake word detection | "Hey Nexus, start recording" | Requires always-on microphone access, local wakeword model (Porcupine, OpenWakeWord), and careful battery/privacy handling; browser does not allow always-on mic | Mic button tap is sufficient; wake word is a future hardware device concern |
|
||||
| Streaming TTS word-by-word | Feels maximally responsive | Browser audio playback of a stream of tiny WAV fragments causes clicks, gaps, and buffering issues; each Piper call has startup overhead; the sentence-buffered approach gives 95% of the benefit | Sentence-buffered playback (buffer on `.!?`); start playing sentence 1 while sentence 2 synthesizes |
|
||||
| Inline code execution over Telegram | "I want to run tasks from Telegram" | Security: arbitrary code execution via an unauthenticated chat interface; scope: Telegram bridge is explicitly a thin relay, not a command interface | Support text and voice message relay only; task creation via conversational agent response is sufficient |
|
||||
| GSD formatting / rich elements in Telegram | Telegram supports inline keyboards, threaded replies — use them | Telegram's formatting model (inline keyboards, callback queries) requires stateful session tracking; PROJECT.md explicitly out-of-scopes this | Plain text + Markdown v1 (which Telegram natively renders for bold/italic/code); no inline keyboards in v1.6 |
|
||||
| Transcription editing before sending | "Let me see the transcript before it goes to the agent" | Adds a confirmation step that breaks the hands-free voice flow; most users trust auto-send after VAD silence detection; optionally show transcript as a message in the UI after the fact | Show the detected transcript in the chat message bubble with a small "mic" icon; no edit step |
|
||||
|
||||
---
|
||||
|
||||
## Feature Dependencies
|
||||
|
||||
```
|
||||
Hardware Detection
|
||||
└──feeds──> Model Recommendation DB
|
||||
└──feeds──> Local AI Setup (Ollama tier)
|
||||
Transport-Agnostic VoicePipelineService
|
||||
└──wraps──> Existing /transcribe endpoint (Whisper) [already built]
|
||||
└──wraps──> Piper TTS binary/WASM [already built in browser; server-side is new]
|
||||
└──consumed-by──> Web chat mic button (browser calls server or uses WASM directly)
|
||||
└──consumed-by──> Telegram bridge (server-side calls VoicePipelineService)
|
||||
└──consumed-by──> Future transports (CLI, API, Command Center)
|
||||
|
||||
Puter.js Integration
|
||||
└──requires──> Puter account (user-side; not a Nexus dependency)
|
||||
└──requires──> Client-side script inclusion (no server-side secrets)
|
||||
Voice Mode Flag
|
||||
└──set-by──> Web chat (user is in voice mode)
|
||||
└──set-by──> Telegram bridge (message arrived as voice note)
|
||||
└──consumed-by──> LLM prompt construction (appends no-markdown instruction)
|
||||
└──consumed-by──> Dual output pattern (triggers two-response format)
|
||||
└──consumed-by──> TTS synthesis (triggers auto-synthesis of response)
|
||||
|
||||
Personal AI Assistant Mode
|
||||
└──requires──> Mode Selection (Personal / Project Builder / Both)
|
||||
└──requires──> Persistent Memory Store (SQLite via existing DB)
|
||||
└──requires──> Existing Chat Interface (v1.3 ChatPanel) [already built]
|
||||
Dual Output Pattern
|
||||
└──requires──> Voice mode flag (only triggers in voice mode)
|
||||
└──requires──> LLM prompt engineering (structured SPOKEN/DETAILED format)
|
||||
└──produces──> Short prose (for TTS, Telegram reply)
|
||||
└──produces──> Full markdown (for web chat display, copy)
|
||||
|
||||
MCP Connections
|
||||
└──requires──> Personal AI Assistant Mode (MCP is an assistant-mode feature)
|
||||
└──requires──> STDIO transport (Node.js child_process, already available in CLI)
|
||||
Web Chat Voice UI (silence detection + waveform)
|
||||
└──requires──> Existing VoiceRecordButton [already built — enhance, not replace]
|
||||
└──requires──> Web Audio API (AnalyserNode for amplitude) [browser built-in]
|
||||
└──enhances──> Voice Mode Toggle (waveform only visible when voice mode active)
|
||||
|
||||
Voice (Piper TTS)
|
||||
└──requires──> Existing Whisper STT (v1.3) [already built]
|
||||
└──enhances──> Personal AI Assistant Mode
|
||||
Web Chat Audio Playback
|
||||
└──requires──> TTS synthesis output (WAV/MP3 audio buffer)
|
||||
└──requires──> Voice mode flag (auto-play only in full voice mode)
|
||||
└──independent──> waveform visualization (different UI component)
|
||||
|
||||
Project Handoff
|
||||
└──requires──> Personal AI Assistant Mode (conversation context exists there)
|
||||
└──requires──> Existing PM Agent Template (v1.4) [already built]
|
||||
└──requires──> Existing Project entity (upstream Paperclip) [already built]
|
||||
└──requires──> LLM summarization call (any configured provider)
|
||||
Telegram Bridge
|
||||
└──requires──> VoicePipelineService (for voice note handling)
|
||||
└──requires──> Existing chat API (POST /api/... for message relay)
|
||||
└──requires──> ffmpeg (OGG/Opus → WAV conversion for Whisper)
|
||||
└──requires──> Telegraf (Node.js bot framework)
|
||||
└──independent──> web chat UI changes
|
||||
|
||||
npx buildthis
|
||||
└──requires──> Existing CLI (Commander.js) [already built]
|
||||
└──requires──> npm publish or private registry setup
|
||||
|
||||
Google OAuth Cloud Tier
|
||||
└──requires──> OAuth flow (Google Sign-In)
|
||||
└──independent──> other provider tiers (each tier is additive)
|
||||
Onboarding STT/TTS Detection
|
||||
└──requires──> Existing VoiceStep [already built — update, not replace]
|
||||
└──requires──> VoicePipelineService availability check
|
||||
└──independent──> Telegram bridge
|
||||
```
|
||||
|
||||
### Dependency Notes
|
||||
|
||||
- **Persistent memory requires existing DB:** Paperclip already uses SQLite/Postgres; a `memory` table (key/value or embedding store) can be added. No ORM change needed if using raw SQL in a new file.
|
||||
- **MCP requires assistant mode to be active:** MCP connections are scoped to the Personal AI Assistant mode, not the Project Builder. They should not be surfaced during project management workflows.
|
||||
- **Hardware detection is a one-time onboarding concern:** Results should be cached; re-detection should be available in Settings but not re-run on every launch.
|
||||
- **Puter.js has no server-side dependency:** The entire integration is client-side JavaScript. This is both a strength (zero backend changes) and a constraint (Puter auth happens in the browser, not on the Nexus server).
|
||||
- **VoicePipelineService is the keystone:** Build this first. It abstracts Whisper + Piper behind a clean interface. Every other v1.6 feature is a consumer. If this is skipped, the Telegram bridge and web improvements become duplicate, divergent code.
|
||||
- **Voice mode flag must be stored on the message:** Not just passed in memory. Future Command Center and Telegram both need to know retroactively whether a message was voice-originated.
|
||||
- **Dual output is optional on non-voice messages:** Text-mode messages do not need the SPOKEN variant. The prompt injection and response parsing only apply when `voice_mode: true`.
|
||||
- **Telegram bridge has no UI:** It's a server-side Node.js process (or Express route). No React changes needed for Telegram.
|
||||
- **ffmpeg is a hard dependency for Telegram voice notes:** Telegram sends OGG/Opus; Whisper expects WAV/MP3. ffmpeg must be available on the server. On Mac Mini this is `brew install ffmpeg`.
|
||||
- **Web chat waveform enhances existing VoiceRecordButton:** Do not replace it. The existing component handles MediaRecorder and send; add AudioWorklet/AnalyserNode visualization on top.
|
||||
|
||||
---
|
||||
|
||||
## MVP Definition
|
||||
|
||||
### Launch With (v1.5 Milestone)
|
||||
### Launch With (v1.6 Milestone)
|
||||
|
||||
Minimum viable set to validate the milestone goals.
|
||||
Minimum viable set to make voice and Telegram genuinely useful, not just technically present.
|
||||
|
||||
- [ ] **Mode selection UI** — Personal AI / Project Builder / Both selector in onboarding + settings. Why essential: gates all assistant-specific features.
|
||||
- [ ] **Hardware detection + model recommendation** — Detect RAM/VRAM, recommend Ollama model. Why essential: the primary UX claim of "smart onboarding."
|
||||
- [ ] **Puter.js cloud tier** — Zero-config provider for users without local AI. Why essential: removes the "I have to install Ollama" barrier.
|
||||
- [ ] **Personal AI Assistant chat with persistent memory** — Conversations that remember previous sessions. Why essential: defines the Personal AI Assistant mode as meaningfully different from existing chat.
|
||||
- [ ] **Summary screen → straight into chat** — After onboarding completes, land in chat not dashboard. Why essential: closes the onboarding funnel.
|
||||
- [ ] **Every step skippable** — Including hardware detection, cloud setup, MCP config. Why essential: PROJECT.md explicitly requires this.
|
||||
- [ ] **Piper TTS** — Text-to-speech for assistant responses. Why essential: completes the voice loop that Whisper STT already started.
|
||||
- [ ] **VoicePipelineService** — Transport-agnostic server-side Whisper + Piper abstraction. Why essential: gates all other features; prevents code duplication between web and Telegram.
|
||||
- [ ] **Voice mode flag + dual output** — LLM receives no-markdown instruction; response splits into spoken prose + full markdown. Why essential: spoken markdown sounds broken; this is what makes TTS usable.
|
||||
- [ ] **Web chat silence detection + auto-submit** — Amplitude-based VAD stops recording automatically and submits. Why essential: hands-free voice only works if the user does not have to click "send."
|
||||
- [ ] **Web chat waveform visualization** — Amplitude bars while recording. Why essential: without it, users cannot tell if the mic is picking up audio.
|
||||
- [ ] **Web chat audio playback with auto-play toggle** — Agent voice responses play inline. Why essential: without playback, TTS synthesis has nowhere to go.
|
||||
- [ ] **Voice mode toggle setting** — Three modes: text only / voice input only / full voice (input + output). Why essential: users need to control the modality per session.
|
||||
- [ ] **Telegram text relay** — Text messages in → agent response out, with agent prefix. Why essential: core use case for phone access.
|
||||
- [ ] **Telegram voice note relay** — Voice notes in → transcribe → agent → text reply. Why essential: mobile Telegram users default to voice notes.
|
||||
|
||||
### Add After Validation (v1.5.x)
|
||||
### Add After Validation (v1.6.x)
|
||||
|
||||
Features to add once core assistant mode is working.
|
||||
|
||||
- [ ] **Project handoff** — "Turn this conversation into a project" button. Trigger: assistant mode is stable and used regularly.
|
||||
- [ ] **MCP server connections** — Curated list with one-click add. Trigger: users request specific tool integrations.
|
||||
- [ ] **Google OAuth cloud tier** — Gemini without API key. Trigger: Puter.js limitations surface (rate limits, cost surprises for users).
|
||||
- [ ] **`npx buildthis` CLI entry point** — Zero-install UX. Trigger: sharing Nexus with others becomes a use case.
|
||||
- [ ] **Telegram TTS reply option** — Agent response synthesized and sent back as an OGG voice note. Trigger: user feedback that text replies are too long to read on phone.
|
||||
- [ ] **Sentence-buffered TTS streaming** — Start audio playback before full synthesis completes. Trigger: latency complaints with longer responses.
|
||||
- [ ] **Voice response history in UI** — Chat messages show audio player for past synthesized responses (not just the current one). Trigger: users want to replay previous responses.
|
||||
|
||||
### Future Consideration (v2+)
|
||||
|
||||
Features to defer until post-v1.5.
|
||||
|
||||
- [ ] **OpenAI OAuth tier** — OpenAI free tier via OAuth; rate limits are aggressive and UX is complex.
|
||||
- [ ] **Subscription/API key auto-detection** — Scan environment for `ANTHROPIC_API_KEY`, etc. Low user value vs. complexity.
|
||||
- [ ] **Memory export/import** — Portable memory across reinstalls. Needs file format design.
|
||||
- [ ] **Multi-MCP orchestration** — Parallel MCP server calls, result merging. Enterprise complexity for personal tool.
|
||||
- [ ] **Real-time speech-to-speech** — Full-duplex conversation; requires Pipecat or LiveKit; entirely different architecture.
|
||||
- [ ] **Wake word detection** — Always-on mic, local wakeword model; hardware device concern.
|
||||
- [ ] **Deep Telegram ↔ web sync** — Bidirectional session mirroring via Postgres bus; deferred per PROJECT.md.
|
||||
- [ ] **Per-transport voice models** — Different Piper voice for Telegram vs. web (e.g., cleaner phone voice vs. natural assistant voice).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -155,163 +164,174 @@ Features to defer until post-v1.5.
|
|||
|
||||
| Feature | User Value | Implementation Cost | Priority |
|
||||
|---------|------------|---------------------|----------|
|
||||
| Mode selection UI | HIGH | LOW | P1 |
|
||||
| Hardware detection + model recommendation | HIGH | MEDIUM | P1 |
|
||||
| Puter.js zero-config cloud | HIGH | MEDIUM | P1 |
|
||||
| Persistent memory (SQLite) | HIGH | MEDIUM | P1 |
|
||||
| Summary screen → chat | HIGH | LOW | P1 |
|
||||
| Every step skippable | HIGH | LOW | P1 |
|
||||
| Piper TTS | MEDIUM | MEDIUM | P1 |
|
||||
| Project handoff | HIGH | HIGH | P2 |
|
||||
| MCP connections (curated) | MEDIUM | MEDIUM | P2 |
|
||||
| Google OAuth cloud tier | MEDIUM | MEDIUM | P2 |
|
||||
| `npx buildthis` | LOW | MEDIUM | P2 |
|
||||
| OpenAI free tier OAuth | LOW | HIGH | P3 |
|
||||
| API key auto-detection | LOW | MEDIUM | P3 |
|
||||
| VoicePipelineService | HIGH | MEDIUM | P1 |
|
||||
| Voice mode flag + dual output | HIGH | MEDIUM | P1 |
|
||||
| Silence detection + auto-submit | HIGH | MEDIUM | P1 |
|
||||
| Waveform visualization | MEDIUM | LOW | P1 |
|
||||
| Audio playback + auto-play toggle | HIGH | LOW | P1 |
|
||||
| Voice mode toggle setting | HIGH | LOW | P1 |
|
||||
| Telegram text relay | HIGH | MEDIUM | P1 |
|
||||
| Telegram voice note relay | HIGH | MEDIUM | P1 |
|
||||
| Telegram TTS reply | MEDIUM | MEDIUM | P2 |
|
||||
| Sentence-buffered TTS streaming | MEDIUM | MEDIUM | P2 |
|
||||
| Voice response history | LOW | MEDIUM | P3 |
|
||||
| Real-time speech-to-speech | HIGH | HIGH | P3 (v2+) |
|
||||
|
||||
**Priority key:**
|
||||
- P1: Must have for v1.5 launch
|
||||
- P2: Should have, add in v1.5.x
|
||||
- P1: Must have for v1.6 launch
|
||||
- P2: Should have, add in v1.6.x
|
||||
- P3: Nice to have, v2+
|
||||
|
||||
---
|
||||
|
||||
## Competitor Feature Analysis
|
||||
|
||||
| Feature | ChatGPT | Claude Projects | Bare Ollama | Nexus v1.5 Approach |
|
||||
|---------|---------|-----------------|-------------|---------------------|
|
||||
| Persistent memory | Yes (cloud) | Yes (project instructions) | No | SQLite local; no cloud required |
|
||||
| Hardware-aware setup | No | No | No | Pre-built model database; auto-recommend |
|
||||
| Zero-config cloud | No (API key) | No (API key) | N/A | Puter.js user-pays model |
|
||||
| Local/offline operation | No | No | Yes (manual) | Ollama + Piper + Whisper; fully offline |
|
||||
| Voice I/O | Yes (cloud) | No | No | Whisper STT (existing) + Piper TTS (new) |
|
||||
| Tool connections | Yes (plugins) | Yes (Projects) | No | MCP servers (curated list) |
|
||||
| Project handoff | No | Partial (copy-paste) | No | One-button conversation → PM agent |
|
||||
| Mode switching | No | No | No | Personal AI / Project Builder / Both |
|
||||
| Feature | ChatGPT Voice Mode | Telegram + other bots | Nexus v1.6 Approach |
|
||||
|---------|--------------------|-----------------------|---------------------|
|
||||
| STT | Whisper (cloud) | Per-bot (usually cloud) | Whisper local, CPU fallback |
|
||||
| TTS | Custom neural (cloud) | gTTS or ElevenLabs | Piper local, CPU-only |
|
||||
| Markdown-free voice | Yes (GPT strips markdown) | Usually not (bots send raw markdown) | Dual output: SPOKEN + DETAILED |
|
||||
| Silence detection | Yes (VAD, full-duplex) | N/A | Amplitude VAD, 1.5s threshold |
|
||||
| Waveform UI | Animated blobs (not literal waveform) | N/A | AnalyserNode amplitude bars |
|
||||
| Agent identity in replies | N/A (single assistant) | Custom per bot | Text prefix `[AgentName]` |
|
||||
| Telegram voice note support | N/A | Varies widely | OGG→WAV→Whisper→agent |
|
||||
| Offline / local operation | No | No | Fully local: Whisper + Piper + Ollama |
|
||||
| Transport abstraction | N/A | N/A | VoicePipelineService (web + Telegram share same service) |
|
||||
|
||||
---
|
||||
|
||||
## Provider Tier Architecture
|
||||
## Voice Pipeline Architecture Notes
|
||||
|
||||
The onboarding should present providers as a tiered funnel, not a flat list. Users land in the highest-comfort tier:
|
||||
**Confidence:** HIGH for the cascading/sequential pipeline; MEDIUM for dual output prompt engineering reliability.
|
||||
|
||||
### Sequential Pipeline (chosen architecture for v1.6)
|
||||
|
||||
```
|
||||
Tier 0: Already have Hermes / Claude Code / OpenClaw running
|
||||
└──detect via env vars or local port scan──> skip straight to summary
|
||||
|
||||
Tier 1: Local AI (most private, no cost)
|
||||
└──Ollama installed?──> detect models, recommend based on hardware
|
||||
└──Ollama not installed?──> show install prompt with one-liner
|
||||
|
||||
Tier 2: Zero-config cloud (easiest, user-pays)
|
||||
└──Puter.js──> "Sign in with Puter" → 500+ models, no API key
|
||||
└──User creates/logs into free Puter account
|
||||
|
||||
Tier 3: OAuth cloud (Google account required, free quota)
|
||||
└──Google Gemini──> OAuth flow → Gemini 2.0 Flash free tier
|
||||
└──Free tier as of 2026: reduced but functional (Gemini 2.0 Flash)
|
||||
|
||||
Tier 4: API key / subscription
|
||||
└──Hermes (existing)
|
||||
└──Claude Code (ANTHROPIC_API_KEY)
|
||||
└──OpenClaw (custom)
|
||||
└──OpenAI (OPENAI_API_KEY)
|
||||
[Browser/Telegram]
|
||||
|
|
||||
| audio buffer (WAV/OGG)
|
||||
v
|
||||
VoicePipelineService.transcribe()
|
||||
|
|
||||
| transcript text + language + confidence
|
||||
v
|
||||
LLM (with voice_mode prompt addendum)
|
||||
|
|
||||
| structured response: SPOKEN: "..." DETAILED: "..."
|
||||
v
|
||||
Response parser → { spoken: string, detailed: string }
|
||||
| |
|
||||
| v
|
||||
| Web chat: render detailed (markdown)
|
||||
| Telegram: send spoken as text
|
||||
v
|
||||
VoicePipelineService.synthesize(spoken)
|
||||
|
|
||||
| WAV audio buffer
|
||||
v
|
||||
Web chat: <audio> element autoplay
|
||||
Telegram (v2): sendVoice() as OGG/Opus
|
||||
```
|
||||
|
||||
**Key insight:** Users should be steered toward Tier 0 or 1 first (most private, most robust for single-user Mac Mini). Puter.js (Tier 2) is the escape hatch for users who won't install Ollama, not the default recommendation.
|
||||
### Why not real-time speech-to-speech:
|
||||
|
||||
Real-time requires full-duplex WebSocket audio, interrupt detection (barge-in), turn-taking state machine, and sub-200ms latency budgets. The sequential pattern targets <3s end-to-end on Apple Silicon M4, which is appropriate for assistant interactions (not phone calls). The complexity delta is enormous; PROJECT.md explicitly defers this.
|
||||
|
||||
---
|
||||
|
||||
## Puter.js Integration Notes
|
||||
## Telegram Bridge Architecture Notes
|
||||
|
||||
**Confidence:** MEDIUM — confirmed working from official docs, but production reliability and rate limit specifics are not publicly documented.
|
||||
**Confidence:** HIGH — Telegraf is the standard Node.js Telegram framework; patterns are well-established.
|
||||
|
||||
- Integration is entirely client-side: `<script src="https://js.puter.com/v2/"></script>` then `puter.ai.chat(model, message)`
|
||||
- Supports 500+ models including GPT-4.1, Claude Sonnet 4, Gemini 2.5 Flash, Llama 3.x
|
||||
- User authenticates against Puter (free account); developer incurs zero cost
|
||||
- Rate limits: not publicly documented; Puter says "no restrictions" but this is unverified at scale
|
||||
- Limitation: requires user to create/have a Puter account — this is friction vs. "truly zero-config"
|
||||
- Risk: Puter's pricing model is described as "still being worked out" — future cost surprises for users possible
|
||||
- Mitigation: Show clear messaging that Puter costs are the user's own account costs, not Nexus costs
|
||||
### Single Bot, Agent Prefix Pattern
|
||||
|
||||
```
|
||||
Telegram user sends: "What's the status of the Nexus project?"
|
||||
|
|
||||
Telegraf handler
|
||||
|
|
||||
POST /api/workspaces/:id/chat/messages
|
||||
{ content: "What's the status...", source: "telegram", voice_mode: false }
|
||||
|
|
||||
SSE stream → collect until [DONE]
|
||||
|
|
||||
bot.sendMessage(chatId, "[Hermes] The Nexus project is currently...")
|
||||
```
|
||||
|
||||
### Voice Note Flow
|
||||
|
||||
```
|
||||
Telegram user sends voice note (OGG/Opus, ~15s)
|
||||
|
|
||||
Telegraf voice handler: bot.getFile() → download OGG
|
||||
|
|
||||
ffmpeg: OGG → WAV (16kHz mono)
|
||||
|
|
||||
VoicePipelineService.transcribe(wavBuffer)
|
||||
|
|
||||
POST /api/workspaces/:id/chat/messages
|
||||
{ content: transcript, source: "telegram", voice_mode: true }
|
||||
|
|
||||
Collect SSE stream → spoken variant of response
|
||||
|
|
||||
bot.sendMessage(chatId, "[Hermes] " + spokenResponse)
|
||||
// v2: bot.sendVoice(chatId, synthesizedOggBuffer)
|
||||
```
|
||||
|
||||
### Key implementation decisions:
|
||||
|
||||
- **Polling vs. webhooks:** Webhooks require a public HTTPS endpoint. For Mac Mini on home network, long polling is the correct choice. Telegraf supports both; use `bot.launch()` (polling mode) for v1.6.
|
||||
- **Bot token storage:** Environment variable `TELEGRAM_BOT_TOKEN`; added to `.env` and loaded via existing env config pattern.
|
||||
- **Authorized users only:** Store allowed Telegram user IDs or usernames in nexus-settings to prevent unauthorized access; a bridge with no auth is a security hole.
|
||||
- **Conversation context:** Each Telegram chat ID maps to a Nexus workspace session; maintain a `telegramChatId → workspaceId + conversationId` mapping in a lightweight in-memory store or SQLite table.
|
||||
|
||||
---
|
||||
|
||||
## Hardware Detection Implementation Notes
|
||||
## Voice Mode Response Formatting Notes
|
||||
|
||||
**Confidence:** HIGH — patterns well-established across Ollama, LM Studio, llm-checker.
|
||||
**Confidence:** MEDIUM — dual output prompt pattern is used in production systems but prompt reliability varies by model; post-processing strip is more reliable.
|
||||
|
||||
Detection sources (Node.js server-side, run once at onboarding):
|
||||
1. `os.totalmem()` — system RAM (always available)
|
||||
2. Spawn `nvidia-smi --query-gpu=memory.total --format=csv,noheader` — NVIDIA VRAM
|
||||
3. `system_profiler SPDisplaysDataType` (macOS) — Apple Silicon unified memory
|
||||
4. Ollama `/api/tags` endpoint — detect already-running models
|
||||
5. `/proc/driver/nvidia/gpus/` (Linux) — alternative NVIDIA detection
|
||||
### Two approaches, use both as fallback:
|
||||
|
||||
Model recommendation lookup table (simplified):
|
||||
**Approach A: Prompt-based dual output (preferred)**
|
||||
Append to system prompt when `voice_mode: true`:
|
||||
```
|
||||
CPU-only / <8GB RAM: phi3:mini (3.8B), llama3.2:1b
|
||||
8-16GB RAM: llama3.2:3b, mistral:7b, phi3:medium
|
||||
16-24GB unified: llama3.1:8b, qwen2.5:7b
|
||||
24GB+ unified / GPU: llama3.1:70b (quantized), qwen2.5:32b
|
||||
When responding, provide two versions:
|
||||
SPOKEN: [1-3 sentences in natural spoken prose, no markdown, no symbols, no lists]
|
||||
DETAILED: [Full response with markdown formatting, code blocks, bullet points as needed]
|
||||
```
|
||||
Parse response: split on `SPOKEN:` and `DETAILED:` markers.
|
||||
|
||||
---
|
||||
**Approach B: Post-processing strip (fallback)**
|
||||
If the model doesn't follow the dual output format, post-process the full response:
|
||||
- Strip `**bold**` → "bold"
|
||||
- Strip `` `code` `` → "code"
|
||||
- Strip `# headers` → remove `#` prefix
|
||||
- Strip `- ` bullet points → convert to sentences or strip
|
||||
- Strip ``` code blocks ``` → summarize as "[code example]" or remove entirely
|
||||
Use as the spoken variant. The full original markdown response is the detailed variant.
|
||||
|
||||
## Persistent Memory Implementation Notes
|
||||
|
||||
**Confidence:** MEDIUM — standard pattern, but the specific storage mechanism in Paperclip's DB needs verification.
|
||||
|
||||
Standard patterns in production personal AI assistants:
|
||||
1. **Summary-based memory:** After each conversation, run an LLM call to extract key facts → store as `memory` rows. On next conversation, inject relevant memories into system prompt.
|
||||
2. **Verbatim storage:** Store full conversation history, retrieve last N messages or vector-search for relevant passages.
|
||||
3. **Hybrid:** Store both summaries (for long-term preferences) and recent verbatim context (for continuity).
|
||||
|
||||
Recommended for Nexus: Summary-based for long-term memory (preferences, ongoing projects, user facts) + last 10 messages as verbatim context. Avoids needing a vector database. Uses existing SQLite schema with a new `assistant_memories` table.
|
||||
|
||||
**MCP-compatible storage:** The MCP memory pattern (used by Penfield, mcp-memory-service) stores memories as MCP tool call results — same summary pattern, just with MCP as the transport. Nexus does not need to implement MCP just for memory; MCP is for external tool connections.
|
||||
|
||||
---
|
||||
|
||||
## Voice Architecture Notes
|
||||
|
||||
**Confidence:** MEDIUM — Piper confirmed CPU-capable and fast on Apple Silicon; full pipeline integration complexity is estimated, not measured.
|
||||
|
||||
Pipeline for full voice I/O:
|
||||
```
|
||||
Microphone → MediaRecorder (browser) → Whisper (existing, v1.3) → LLM (any provider)
|
||||
↓
|
||||
Speaker ← Web Audio API ← Piper TTS (new) ← Text response ←────────────┘
|
||||
```
|
||||
|
||||
Piper TTS:
|
||||
- Open-source (rhasspy/piper), MIT license
|
||||
- Runs on CPU; Apple Silicon M4 handles it in real-time
|
||||
- Node.js integration: spawn `piper` binary with text via stdin, read WAV from stdout
|
||||
- Voice models: compact (few MB) per language/voice; ship one English voice as default
|
||||
- Streaming: buffer by sentence for lower perceived latency (start playing sentence 1 while sentence 2 synthesizes)
|
||||
|
||||
Whisper is already integrated (v1.3). Piper adds the TTS half to complete the loop.
|
||||
**Reliable rule:** Never read markdown symbols aloud. Either approach prevents this; dual output is preferred because it lets the LLM choose better phrasing for spoken delivery (short, natural sentences vs. information-dense bullets).
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [Puter.js Free AI API (developer.puter.com)](https://developer.puter.com/tutorials/free-unlimited-ai-api/)
|
||||
- [Puter.js Free LLM API (developer.puter.com)](https://developer.puter.com/tutorials/free-llm-api/)
|
||||
- [Puter User-Pays Model (docs.puter.com)](https://docs.puter.com/user-pays-model/)
|
||||
- [Ollama Hardware Detection and GPU Support (deepwiki.com)](https://deepwiki.com/ollama/ollama/6-gpu-and-hardware-support)
|
||||
- [Ollama VRAM Requirements 2026 (localllm.in)](https://localllm.in/blog/ollama-vram-requirements-for-local-llms)
|
||||
- [AI Hardware Guide 2026 (localaimaster.com)](https://localaimaster.com/blog/ai-hardware-requirements-2025-complete-guide)
|
||||
- [Model Context Protocol Wikipedia](https://en.wikipedia.org/wiki/Model_Context_Protocol)
|
||||
- [MCP for Persistent Memory (medium.com)](https://medium.com/mynextdeveloper/how-to-set-up-model-context-protocol-mcp-for-persistent-memory-in-your-ai-app-9c2f819f5c21)
|
||||
- [Piper TTS GitHub (rhasspy/piper)](https://github.com/rhasspy/piper)
|
||||
- [Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
|
||||
- [The Voice AI Stack for Building Agents (assemblyai.com)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
|
||||
- [One-Second Voice-to-Voice Latency with Modal, Pipecat, and Open Models (modal.com)](https://modal.com/blog/low-latency-voice-bot)
|
||||
- [Voice Chat with Local LLMs: Whisper + TTS (insiderllm.com)](https://www.insiderllm.com/guides/voice-chat-local-llms-whisper-tts/)
|
||||
- [Google Gemini API Free Tier 2026 (aifreeapi.com)](https://www.aifreeapi.com/en/posts/google-gemini-api-free-tier)
|
||||
- [Google Gemini OAuth via Opencode (syntackle.com)](https://syntackle.com/blog/google-gemini-ai-subscription-with-opencode/)
|
||||
- [AI Handoff Patterns in Multi-Agent Systems (towardsdatascience.com)](https://towardsdatascience.com/how-agent-handoffs-work-in-multi-agent-systems/)
|
||||
- [Building an NPX CLI Tool (johnsedlak.com)](https://johnsedlak.com/blog/2025/03/building-an-npx-cli-tool)
|
||||
- [Postman Onboarding UX Lessons (candu.ai)](https://www.candu.ai/blog/postman-onboarding-ux-lessons)
|
||||
- [whisper-cpp VAD (ggml-org/whisper.cpp on GitHub)](https://github.com/ggml-org/whisper.cpp)
|
||||
- [Telegram Bot API — sendVoice (core.telegram.org)](https://core.telegram.org/bots/api)
|
||||
- [Convert Voice Memos from Telegram to Text using OpenAI Whisper (dev.to)](https://dev.to/techresolve/solved-convert-voice-memos-from-telegram-to-text-using-openai-whisper-api-41al)
|
||||
- [Telegram speech-to-text bot with Node.js (loonskai.com)](https://www.loonskai.com/blog/telegram-speech-to-text-bot-with-nodejs)
|
||||
- [Telegraf: Modern Telegram Bot Framework for Node.js (telegraf.js.org)](https://telegraf.js.org/)
|
||||
- [HA Voice PE markdown post-processing discussion (community.home-assistant.io)](https://community.home-assistant.io/t/ha-voice-pe-add-post-processing-step-between-conversation-agent-and-speech-to-text-step/893933)
|
||||
- [Two design patterns for Telegram Bots (dev.to/madhead)](https://dev.to/madhead/two-design-patterns-for-telegram-bots-59f5)
|
||||
- [Design voice AI experiences — LiveKit Agents UI (livekit.com)](https://livekit.com/blog/design-voice-ai-interfaces-with-agents-ui)
|
||||
- [User Interaction Patterns in LLM-Powered Voice Assistants (arxiv.org)](https://arxiv.org/html/2309.13879v2)
|
||||
- [voicegram PyPI — OGG/Opus conversion (pypi.org)](https://pypi.org/project/voicegram/)
|
||||
|
||||
---
|
||||
*Feature research for: Nexus v1.5 Smart Onboarding + Personal AI Assistant*
|
||||
*Researched: 2026-04-02*
|
||||
*Feature research for: Nexus v1.6 Voice Pipeline + Minimal Telegram Bridge*
|
||||
*Researched: 2026-04-03*
|
||||
|
|
|
|||
|
|
@ -2,13 +2,14 @@
|
|||
|
||||
**Domain:** Forked open-source project with display-layer renames, no i18n layer
|
||||
**Researched:** 2026-04-02 (updated for v1.5 milestone: smart onboarding, multi-provider, voice TTS, persistent memory, assistant mode, `npx buildthis`)
|
||||
**Updated:** 2026-04-03 (v1.6 milestone: server-side Whisper STT, server-side Piper TTS, Telegram bridge)
|
||||
**Confidence:** HIGH — based on direct codebase analysis of `/opt/nexus/` plus targeted research on each new integration domain
|
||||
|
||||
---
|
||||
|
||||
## About This Document
|
||||
|
||||
This file covers pitfalls for the **v1.5 milestone additions**. The original pitfalls (Pitfalls 1–11) covering fork hygiene, display-layer rename discipline, and upstream sync remain valid and are preserved below. Pitfalls 12–26 are new for v1.5.
|
||||
This file covers pitfalls for the **v1.5 and v1.6 milestone additions**. The original pitfalls (Pitfalls 1–11) covering fork hygiene, display-layer rename discipline, and upstream sync remain valid and are preserved below. Pitfalls 12–26 are new for v1.5. Pitfalls 27–44 are new for v1.6 (voice pipeline + Telegram bridge).
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -493,6 +494,430 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
|
||||
---
|
||||
|
||||
## Critical Pitfalls (v1.6 — Voice Pipeline + Telegram Bridge)
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 27: Audio Format Mismatch Between Browser Recording and Whisper Input
|
||||
|
||||
**What goes wrong:** The browser's `MediaRecorder` produces audio in formats that vary by browser. Chrome records `audio/webm;codecs=opus`. Firefox records `audio/ogg;codecs=opus`. Safari (since 18.4) can record `audio/webm;codecs=opus` but used to produce `audio/mp4`. Whisper (and faster-whisper) requires 16 kHz mono PCM WAV — none of these formats match directly.
|
||||
|
||||
The trap is assuming a single format pipeline will work everywhere. Sending a WebM blob directly to Whisper either causes a silent transcription failure (empty string returned) or an error that is swallowed by the error handler, making the feature appear to work while returning nothing.
|
||||
|
||||
**Why it happens:** Browser format diversity is historically inconsistent. `MediaRecorder.isTypeSupported('audio/webm;codecs=opus')` returns true in Chrome, Firefox, and Safari 18.4+ — but the produced bitrates and frame durations differ in ways that affect downstream processing. Developers test on Chrome and never encounter Firefox/Safari failures.
|
||||
|
||||
**How to avoid:**
|
||||
1. Always transcode to 16 kHz mono WAV on the server before passing audio to Whisper. Use ffmpeg: `ffmpeg -i input -ar 16000 -ac 1 -f wav output.wav`. This handles any valid audio format the browser might send.
|
||||
2. On the client, use `MediaRecorder.isTypeSupported()` to detect the actual format being used and send the MIME type in the upload request header so the server knows what it is receiving.
|
||||
3. Do not assume the file extension from the Content-Type header — WebM containers can hold different codecs; always transcode rather than assume.
|
||||
4. ffmpeg must be installed and in PATH on the server. Make this a hard dependency checked at server startup (`which ffmpeg || exit 1`), not a silent fallback.
|
||||
|
||||
**Warning signs:**
|
||||
- Transcription returns empty string on Safari or Firefox but works on Chrome
|
||||
- Whisper logs show "unsupported format" or "decode error"
|
||||
- `ffmpeg` not in PATH on production server (check server startup log)
|
||||
- Audio upload succeeds (HTTP 200) but transcribed text is empty
|
||||
|
||||
**Phase to address:** Phase 1 (Whisper STT pipeline) — transcode pipeline must be in place before any browser testing.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 28: Telegram Voice Messages Arriving as OGG/Opus at 48 kHz, Not 16 kHz
|
||||
|
||||
**What goes wrong:** Telegram voice messages use the OGG container with Opus codec at 48 kHz mono, stored as `audio_[id].ogg`. This is a different container format from what the browser sends (WebM), and a different sample rate from what Whisper expects (16 kHz). Treating the two pipelines identically breaks silently: ffmpeg will convert the format but the sample rate mismatch causes Whisper to either produce garbage transcriptions or fail.
|
||||
|
||||
A documented second trap: some Telegram voice messages arrive with the MIME type flagged as `audio/ogg` but the file extension `.oga`. Not all MIME type parsers recognize `.oga`, so the media pipeline may classify the file as "unrecognized audio" and skip transcription entirely.
|
||||
|
||||
**Why it happens:** Telegram's wire format is documented but developers building voice-to-text pipelines often copy the browser audio pipeline without adjusting for Telegram's specific encoding. The OGG container with Opus codec at 48 kHz is valid audio that plays fine in media players, so local testing succeeds but transcription quality degrades.
|
||||
|
||||
**How to avoid:**
|
||||
1. Use a dedicated Telegram audio conversion step: `ffmpeg -i input.ogg -ar 16000 -ac 1 -f wav output.wav`. This is identical to the browser pipeline but sourced from a downloaded Telegram file, not a browser blob.
|
||||
2. Download the Telegram voice file using `getFile` + the CDN URL before transcribing. Do not attempt to stream or pipe Telegram file downloads directly to Whisper.
|
||||
3. Treat `.oga` and `.ogg` as the same format — normalize file handling to check codec metadata rather than relying on extension.
|
||||
4. Log the input audio duration before transcribing: if Telegram sends a 0-byte or corrupted file, ffmpeg will fail loudly rather than silently returning empty text.
|
||||
|
||||
**Warning signs:**
|
||||
- Telegram voice messages return empty transcription while browser voice works correctly
|
||||
- ffmpeg logs showing "48000 Hz" input — correct but needs explicit `-ar 16000` flag
|
||||
- Files downloaded from Telegram with `.oga` extension not recognized by MIME type check
|
||||
|
||||
**Phase to address:** Phase 3 (Telegram bridge audio handling) — the OGG/Opus download-and-transcode path must be tested with real Telegram voice messages before the bridge ships.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 29: Spawning a New Piper Process Per TTS Request (Process-Per-Request Anti-Pattern)
|
||||
|
||||
**What goes wrong:** The Piper binary is a CLI tool: `piper --model voice.onnx --output_file out.wav < text.txt`. The naive Node.js integration spawns a new process for each TTS request. Two problems emerge:
|
||||
|
||||
1. **Model reload latency.** Piper loads the ONNX voice model into memory on startup. On CPU-only hardware (M4 Mac Mini with no explicit CUDA), this takes 200–800ms per request. For a voice reply to a short message, this means 1–2 seconds of silence before audio starts.
|
||||
|
||||
2. **Long text truncation.** A documented Piper bug: when processing text longer than ~500 characters via stdin pipe, Piper silently truncates the output or exits early. The generated audio file exists but is shorter than expected. The calling code sees a successful exit code and plays the truncated audio without knowing content was lost.
|
||||
|
||||
**Why it happens:** CLI tools feel simple to integrate. The first working implementation spawns a process, gets output, done. The model-reload cost and the long-text bug only surface in production use with real message lengths.
|
||||
|
||||
**How to avoid:**
|
||||
1. Run Piper as a persistent HTTP service on a local port (there is a community `piper-http` wrapper, or implement one). The process stays alive between requests, keeping the model in memory.
|
||||
2. For long responses (>400 characters), split text into sentence-level chunks before sending to Piper. Synthesize each chunk and concatenate the WAV files. This avoids both truncation and per-request reload cost.
|
||||
3. Implement a warmup call at server startup: send a short dummy text to Piper to force model loading before the first real request.
|
||||
4. Cap TTS at a reasonable character limit for voice output (e.g., 1500 chars) — this is a UX constraint anyway; wall-of-text responses should not be read aloud verbatim.
|
||||
|
||||
**Warning signs:**
|
||||
- First TTS response takes 2+ seconds after server restart
|
||||
- Audio playback cuts off mid-sentence on responses longer than ~30 words
|
||||
- Process table shows a new `piper` process appearing and dying for each TTS request
|
||||
- TTS works in unit tests (short strings) but fails in integration tests (real agent responses)
|
||||
|
||||
**Phase to address:** Phase 2 (Piper TTS pipeline) — persistent process architecture must be designed before the first response endpoint is implemented.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 30: Whisper Model Loading on Every Request (Memory Spike Anti-Pattern)
|
||||
|
||||
**What goes wrong:** Whisper and faster-whisper load a large model into memory (tiny: ~150MB, small: ~500MB, medium: ~1.5GB). If the STT endpoint loads the model fresh for each HTTP request — or if the Python process exits and restarts — every concurrent transcription request duplicates the model in memory. On an M4 Mac Mini with 16GB unified memory running Paperclip + Ollama, this can cause the system to swap and degrade all services.
|
||||
|
||||
A secondary issue: faster-whisper has a documented memory leak where RAM from a transcription session is not fully released. On a long-running server, this causes gradual memory growth over hours.
|
||||
|
||||
**Why it happens:** Python subprocesses spawned from Node.js are short-lived by default. The "simplest integration" is `spawn('python3', ['transcribe.py', audioPath])` — this reloads the model every time. Developers test with a handful of requests and don't observe the memory pattern.
|
||||
|
||||
**How to avoid:**
|
||||
1. Run Whisper/faster-whisper as a persistent sidecar process (e.g., a FastAPI service on `localhost:8001`). Node.js calls `POST /transcribe` via HTTP. The model stays loaded in the Python process between requests.
|
||||
2. On the Mac Mini M4, use `whisper-mlx` or `mlx-whisper` which uses Apple's MLX framework for 2–3x faster transcription on Apple Silicon with lower memory overhead compared to PyTorch.
|
||||
3. Implement a request queue in the sidecar: accept one transcription at a time, queue the rest. This prevents concurrent requests from doubling memory usage.
|
||||
4. Add a health check endpoint to the sidecar: `/health` returns model load status. The main server waits for this to be healthy before routing traffic.
|
||||
|
||||
**Warning signs:**
|
||||
- Memory usage spikes by 500MB+ on each transcription request
|
||||
- `python3` processes appearing in `ps aux` that don't match the count of active requests
|
||||
- Transcription latency increasing linearly with server uptime (memory leak indicator)
|
||||
- System starts swapping after 20–30 transcription requests
|
||||
|
||||
**Phase to address:** Phase 1 (Whisper STT pipeline) — sidecar architecture must be the starting design, not a later refactor.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 31: Browser Silence Detection Triggering Too Early or Too Late
|
||||
|
||||
**What goes wrong:** The web chat mic button uses client-side silence detection to auto-stop recording and send the audio. Threshold-based silence detection (RMS below X for N milliseconds) has two failure modes:
|
||||
|
||||
1. **Too eager:** Fires after a natural pause mid-sentence ("I want to... create a new task"). The user is still speaking, but the detector interprets the inter-clause pause as end-of-speech. Audio is sent and transcribed as incomplete input.
|
||||
|
||||
2. **Too late:** In a quiet room with a good microphone, even breathing or HVAC noise keeps the RMS above the silence threshold. Recording never auto-stops. The user waits, unsure if anything is working.
|
||||
|
||||
Simple RMS-based detection (the first approach most developers reach for) achieves only ~50% true positive rate for end-of-speech detection at a 5% false positive rate. Production-quality VAD (Silero, Picovoice Cobra) achieves 87–99%.
|
||||
|
||||
**Why it happens:** RMS threshold detection is two lines of code. It works in demo conditions (quiet room, clear speech, no pauses). It fails noticeably in real use. Developers ship the demo implementation.
|
||||
|
||||
**How to avoid:**
|
||||
1. Use `@ricky0123/vad-web` (browser-native VAD using Silero model via ONNX Runtime Web). It runs off the main thread, handles natural pauses, and achieves significantly better accuracy than threshold detection.
|
||||
2. Set a maximum recording duration (e.g., 60 seconds) as a fallback — always auto-stop even if silence detection is confused.
|
||||
3. Show a waveform visualization while recording so users can see whether the mic is capturing audio (helps them self-diagnose "is it recording?").
|
||||
4. Provide a manual stop button alongside auto-stop — never rely solely on automatic detection.
|
||||
|
||||
**Warning signs:**
|
||||
- Transcription of "I want to" submitted as complete message
|
||||
- Recording indicator stays active for 30+ seconds in normal use
|
||||
- Users repeatedly clicking the mic button because auto-stop didn't fire
|
||||
- Silence threshold value hardcoded to a constant (needs to be calibrated per device)
|
||||
|
||||
**Phase to address:** Phase 2 (Web chat mic button) — VAD library choice must be made before the recording UI is built.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 32: Voice Mode Flag Not Propagated Through the Agent Session Layer
|
||||
|
||||
**What goes wrong:** The voice mode flag (`isVoiceMode: true`) attached to a user message in the web chat must reach the agent's system prompt generator to trigger voice-optimized response formatting (shorter sentences, no markdown, no code blocks). If the flag is stripped or not forwarded at any point in the message pipeline — SSE event, message persistence, agent session codec, Hermes adapter — the agent responds in its default text format. The TTS then tries to synthesize text containing backtick code blocks, markdown headers, and bullet points, producing robotic-sounding output like "backtick backtick backtick bash backtick backtick backtick."
|
||||
|
||||
**Why it happens:** Message metadata (anything beyond `content` and `role`) is treated as optional. Each layer in the pipeline — the Express route handler, the SSE broadcaster, the DB persistence layer, the adapter's session encoder — may serialize/deserialize the message and drop non-standard fields. The flag is present in the browser but never reaches the agent runtime.
|
||||
|
||||
**How to avoid:**
|
||||
1. Audit every layer the message passes through: client → `POST /api/chat/messages` → message persistence → agent session codec → Hermes adapter system prompt. Verify `isVoiceMode` (or equivalent) is preserved at each layer.
|
||||
2. Use Paperclip's existing message metadata mechanism (if present) rather than adding a top-level field that might be stripped. Check whether the message schema has a `metadata` JSON column.
|
||||
3. Test the full pipeline end-to-end: send a voice-flagged message and check the agent's system prompt (log it in development mode) to confirm the voice formatting instruction is present.
|
||||
4. The dual output pattern (voice-optimized response + full text with code blocks) requires the LLM to produce two outputs or a structured output with separate fields. Design this contract before implementing either end.
|
||||
|
||||
**Warning signs:**
|
||||
- Agent responses in voice mode contain markdown formatting
|
||||
- TTS output includes spoken "asterisk", "backtick", or "hash" characters
|
||||
- `isVoiceMode` present in browser network request payload but absent in server-side agent session log
|
||||
- Voice and text responses are identical in content and format
|
||||
|
||||
**Phase to address:** Phase 1 (Voice mode flag) — the flag propagation path must be fully designed and tested with a no-op handler before the dual output pattern is built on top of it.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 33: Telegram Bridge Creating a Competing Session Identity for the Same Agent
|
||||
|
||||
**What goes wrong:** Paperclip's agent session model assigns one session per agent per "channel" (web, API, etc.). The Telegram bridge opens a new channel. If the bridge creates a new session for each Telegram message instead of maintaining a persistent session for the Telegram channel, the agent loses context between Telegram messages — each message starts a fresh conversation.
|
||||
|
||||
A documented related bug in similar systems: when the Telegram bridge relays a message through the web channel gateway instead of its own channel, the session's `channel` field gets overwritten from `telegram` to `webchat`. Subsequent agent replies are then routed to the web UI, not back to Telegram. The user sends a Telegram message and the reply appears in the browser but never arrives in Telegram.
|
||||
|
||||
**Why it happens:** Reusing the existing web session mechanism (the path of least resistance) overwrites session channel metadata. The Telegram bridge needs its own channel identity that the session persists.
|
||||
|
||||
**How to avoid:**
|
||||
1. Create a dedicated `telegram` channel type in the session layer. Do not route Telegram messages through the `webchat` gateway — use a separate message path that preserves `channel: "telegram"`.
|
||||
2. Maintain a persistent session per Telegram chat ID (not per message). Store the `sessionId ↔ chatId` mapping in a lightweight lookup (in-memory Map for single-user deployment, or a simple JSON file).
|
||||
3. On agent reply, inspect the originating session's channel field. Route replies to Telegram if `channel === "telegram"`, to the web UI if `channel === "webchat"`. Never allow this routing to be overwritten by message relay logic.
|
||||
4. Test the routing explicitly: send a Telegram message, verify the reply arrives in Telegram (not in web UI), send a web chat message to the same agent, verify the reply arrives in the browser (not in Telegram).
|
||||
|
||||
**Warning signs:**
|
||||
- Telegram messages receive no reply in Telegram but a reply appears in the web chat interface
|
||||
- Session `channel` field changes from `telegram` to `webchat` after the first reply
|
||||
- New session created for every Telegram message (no conversation continuity)
|
||||
- Agent session table grows unboundedly (new session per Telegram message)
|
||||
|
||||
**Phase to address:** Phase 3 (Telegram bridge session layer) — session identity design must be finalized before the bridge handler is implemented.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 34: Telegram Webhook vs. Long Polling — Wrong Choice for This Deployment
|
||||
|
||||
**What goes wrong:** Telegram bots receive updates either via long polling (`getUpdates`) or webhooks. The choice matters for this deployment:
|
||||
|
||||
- **Webhook** requires: a publicly accessible HTTPS URL, a valid TLS certificate, and a port in [80, 88, 443, 8443]. The Nexus deployment is a local Mac Mini without a public URL. Webhooks do not work behind NAT/LAN without a tunnel (ngrok, Cloudflare Tunnel, etc.).
|
||||
- **Long polling** works from behind NAT. The bot proactively calls `getUpdates` every N seconds. No public URL required.
|
||||
|
||||
The trap: developers set up webhooks because webhook tutorials are more common, then wonder why Telegram isn't delivering updates. Or they use long polling but run multiple processes that all call `getUpdates` simultaneously — Telegram delivers each update to only one caller, so updates are split between processes and lost.
|
||||
|
||||
**Why it happens:** Webhook is the "production-grade" recommendation in most Telegram bot guides. Local deployment contexts are underrepresented in tutorials.
|
||||
|
||||
**How to avoid:**
|
||||
1. For this deployment (local Mac Mini, no public URL, single user): use long polling. It is simpler, works behind NAT, and the latency difference (1–2 seconds vs. real-time) is irrelevant for a personal assistant.
|
||||
2. Ensure only ONE process calls `getUpdates`. If the Express server restarts, verify the previous polling loop has stopped before starting a new one.
|
||||
3. Use a Telegram bot library (Telegraf, grammY) rather than raw HTTP polling — these libraries handle the polling loop, update acknowledgement, and error recovery correctly.
|
||||
4. Never mix polling and webhooks: if a webhook was previously registered, it must be explicitly deleted (`deleteWebhook`) before long polling will work.
|
||||
|
||||
**Warning signs:**
|
||||
- Telegram updates not arriving despite correct bot token
|
||||
- Some messages received, others not (multiple processes polling simultaneously)
|
||||
- `setWebhook` called during testing but server not publicly accessible
|
||||
- `getWebhookInfo` returns a webhook URL pointing to `localhost`
|
||||
|
||||
**Phase to address:** Phase 1 (Telegram bridge setup) — polling vs. webhook decision must be made before any bot code is written.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 35: TTS Synthesizing Agent Prefixes, Timestamps, and Metadata Verbatim
|
||||
|
||||
**What goes wrong:** The Telegram bridge prefixes agent replies with the agent name: `[Nexus] Here is your answer...`. The web chat renders this as a styled badge. When the same message content is passed to TTS, Piper synthesizes "bracket Nexus bracket Here is your answer" — the prefix is read aloud verbatim.
|
||||
|
||||
Similarly, if any message metadata (issue IDs like `#ISS-42`, timestamps, Markdown formatting characters) reaches the TTS synthesis input without being stripped, the audio sounds broken and robotic.
|
||||
|
||||
**Why it happens:** The message content as stored is the same string used for both display (where the prefix is rendered as a badge) and audio synthesis. The stripping step is obvious in retrospect but is easily forgotten when the display rendering works correctly.
|
||||
|
||||
**How to avoid:**
|
||||
1. Create a `sanitizeForTTS(text: string): string` utility function applied before any text reaches the Piper synthesis call. It strips: agent prefixes (`[AgentName]` patterns), Markdown formatting (`**`, `*`, `#`, backticks, `>` blockquote markers), issue/task IDs (`#ISS-\d+`, `#TSK-\d+`), URLs (replace with "link"), and code blocks (replace with "code example").
|
||||
2. Apply this sanitization at the TTS layer, not at the storage layer — the stored message should remain unmodified so the web UI can render it correctly.
|
||||
3. For the dual output pattern (voice-optimized + full text), the voice-optimized variant should already be prose-formatted — `sanitizeForTTS` is a safety net, not the primary formatting mechanism.
|
||||
|
||||
**Warning signs:**
|
||||
- TTS reads "asterisk asterisk important asterisk asterisk" instead of "important"
|
||||
- TTS reads "hash" or "pound" characters from Markdown headers
|
||||
- Agent prefix brackets audible in playback
|
||||
- Code block content being read aloud character by character
|
||||
|
||||
**Phase to address:** Phase 2 (Piper TTS pipeline + dual output pattern) — `sanitizeForTTS` must exist before the first TTS integration test.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 36: CPU-Only Whisper Model Size Too Large for Acceptable Latency
|
||||
|
||||
**What goes wrong:** The Whisper model family spans: tiny (39M params, ~200MB), base (74M, ~300MB), small (244M, ~500MB), medium (769M, ~1.5GB), large (1.5B, ~3GB). On Apple Silicon M4 with Metal/MLX acceleration, the medium model runs in under 1 second for typical voice input. On CPU-only fallback (or if MLX is not configured), medium model transcription takes 4–15 seconds for a 10-second clip — too slow for interactive voice use.
|
||||
|
||||
Developers test on the primary deployment target (M4 with MLX fast path) and set `model: "medium"` as the default. On any other machine (CI server, Docker container, Linux server without Metal), the same default makes the feature unusable.
|
||||
|
||||
**Why it happens:** The bottleneck is hardware-dependent and only surfaces when Metal/MLX is unavailable. The test environment is the Mac Mini M4 where everything is fast.
|
||||
|
||||
**How to avoid:**
|
||||
1. Make the Whisper model size configurable at startup, not hardcoded. Default to `small` (good accuracy, fast on CPU), allow upgrade to `medium` or `large` in config.
|
||||
2. Add hardware detection to the STT sidecar startup: if Apple Silicon + MLX available, default to `medium`; if CPU-only, default to `small` or `tiny`.
|
||||
3. Benchmark the chosen model on the target hardware before committing to it: `time python3 -c "from faster_whisper import WhisperModel; m = WhisperModel('small'); list(m.transcribe('test.wav'))"`.
|
||||
4. For the Mac Mini M4 specifically: `mlx-whisper` or `whisper-mlx` uses Apple's MLX framework and is 2–8x faster than faster-whisper's CPU path, and does not require CUDA.
|
||||
|
||||
**Warning signs:**
|
||||
- Transcription taking 5+ seconds for a 5-second voice clip
|
||||
- Default model is `medium` or `large` without hardware detection
|
||||
- MLX not installed or not used (check: `python3 -c "import mlx"` should succeed on M4)
|
||||
- STT latency acceptable on the dev machine but reported as "frozen" on other hardware
|
||||
|
||||
**Phase to address:** Phase 1 (Whisper STT pipeline + CPU fallback) — model selection logic and hardware detection must be in place before the latency target (<3 seconds) is validated.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 37: Telegram File Downloads Blocking the Bot Event Loop
|
||||
|
||||
**What goes wrong:** When a Telegram voice message arrives, the bot must: (1) call `getFile` to get the file path, (2) download the file from Telegram's CDN, (3) transcode with ffmpeg, (4) transcribe with Whisper. Steps 2–4 each take 0.5–3 seconds. If the bot processes messages synchronously in the main event loop, it cannot acknowledge incoming updates during this window. Telegram resends unacknowledged updates after a timeout, causing the bot to process the same voice message multiple times and flood the agent with duplicate transcriptions.
|
||||
|
||||
**Why it happens:** Bot frameworks (Telegraf, grammY) handle one update at a time by default. Voice message handling is I/O-heavy. The simple implementation puts all processing in the message handler, which blocks the next update from being processed.
|
||||
|
||||
**How to avoid:**
|
||||
1. Acknowledge the Telegram update immediately (return from the handler without awaiting the full pipeline). Kick off the transcription + agent call pipeline asynchronously.
|
||||
2. Use a per-chat-ID in-flight tracker: if a voice transcription is already in progress for a given `chatId`, queue the next one rather than spawning a second concurrent pipeline.
|
||||
3. Send an intermediate "Transcribing..." status message to Telegram immediately after receiving the voice message, so the user gets immediate feedback while the pipeline runs.
|
||||
4. Set a timeout on the ffmpeg + Whisper steps. If transcription takes longer than 30 seconds, send an error reply and discard the audio.
|
||||
|
||||
**Warning signs:**
|
||||
- Same Telegram voice message transcribed 2–3 times (duplicate update delivery)
|
||||
- Bot stops responding to text messages while a voice message is being processed
|
||||
- Telegram delivery reports showing retried updates
|
||||
- `getFile` + `downloadFile` calls in the main event handler (not in a background task)
|
||||
|
||||
**Phase to address:** Phase 3 (Telegram bridge) — async pipeline architecture must be in place before end-to-end testing with real voice messages.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 38: Piper Binary Not Found When Node.js Server Starts as a Service
|
||||
|
||||
**What goes wrong:** `piper` is installed to a user directory like `~/.local/share/piper-tts/piper` or `/usr/local/bin/piper`. When Node.js server runs interactively in a terminal, the shell PATH includes this directory. When the server starts via a system service (launchd on macOS, systemd on Linux), the service environment has a minimal PATH that does not include user-local directories. `child_process.spawn('piper', ...)` throws `ENOENT: no such file or directory`.
|
||||
|
||||
This is a common and non-obvious failure: the feature works in development (interactive terminal) and silently fails in production (service startup).
|
||||
|
||||
**Why it happens:** Service environment PATH is not the same as interactive shell PATH. This is a standard UNIX gotcha that every server deployment eventually encounters.
|
||||
|
||||
**How to avoid:**
|
||||
1. Never rely on PATH resolution for subprocess binaries in server code. Store the absolute path to the Piper binary in the Nexus config file and use it explicitly in `spawn()`: `spawn('/usr/local/bin/piper', ...)`.
|
||||
2. Check for the binary at server startup and log its absolute path: `which piper || echo "piper not found in PATH"` in the startup health check.
|
||||
3. Add a `voices.piper_binary_path` config key that can be overridden in `~/.paperclip/nexus.yaml` without code changes.
|
||||
4. The same issue applies to `ffmpeg`. Both must be resolved to absolute paths.
|
||||
|
||||
**Warning signs:**
|
||||
- TTS works when running `pnpm dev` but fails when running via launchctl/systemctl
|
||||
- `ENOENT` errors in server logs for `piper` or `ffmpeg` processes
|
||||
- `process.env.PATH` in server context is shorter than interactive shell PATH
|
||||
|
||||
**Phase to address:** Phase 2 (Piper TTS pipeline) — absolute path configuration must be in place before any service deployment testing.
|
||||
|
||||
---
|
||||
|
||||
## Moderate Pitfalls (v1.6)
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 39: Dual Output Pattern Producing Two Separate LLM Calls Per Voice Message
|
||||
|
||||
**What goes wrong:** The dual output pattern (voice-optimized response + full text with code blocks) is straightforward to implement as two separate LLM calls: one with a "respond in plain spoken prose" system prompt for TTS, one with the standard formatting for display. But two calls per voice message doubles cost and doubles latency. For a local Hermes/Ollama backend, this doubles the time-to-response.
|
||||
|
||||
An alternative (one call, structured output) requires the LLM to produce a JSON object with `{ voice: "...", text: "..." }`. This requires reliable structured output, which is model-dependent — smaller models (7B) produce malformed JSON under structured output constraints more often than larger ones.
|
||||
|
||||
**Why it happens:** Two-call is the obvious, correct first implementation. The optimization is non-trivial and model-dependent.
|
||||
|
||||
**How to avoid:**
|
||||
1. For the MVP, use a single LLM call. Ask the agent to produce a voice-formatted response (plain prose, no markdown). Display the voice-formatted text in the chat UI as well — users reading the chat still get the content, just formatted for voice.
|
||||
2. Reserve the dual-output (voice prose + full-text-with-code) pattern for a later iteration when the voice pipeline is stable and the cost/latency of two calls is measurable.
|
||||
3. If dual output is required from the start: use function calling / tool use to get structured output rather than relying on JSON in the completion text. Most current models support structured output via the tools API more reliably than via raw JSON generation.
|
||||
|
||||
**Warning signs:**
|
||||
- Two sequential LLM calls observed in the server log per voice-flagged message
|
||||
- Latency in voice mode is 2× text mode latency
|
||||
- Structured output JSON malformed ~10% of the time on the 7B model
|
||||
|
||||
**Phase to address:** Phase 1 (Voice mode flag + dual output pattern) — design the output contract before implementation to avoid a costly rewrite.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 40: Audio Playback Autoplay Blocked by Browser Policy
|
||||
|
||||
**What goes wrong:** Browsers block `audio.play()` calls that are not triggered by a user gesture. The voice pipeline flow is: user records → server transcribes → agent responds → server synthesizes → client receives audio blob → client plays. The final `audio.play()` call is triggered by an SSE event or fetch response completion, not by a user gesture. Chrome and Safari block this as autoplay.
|
||||
|
||||
The feature appears to work in development (because the developer's browser has granted the page autoplay permissions during testing) and fails for first-time users on a clean browser profile.
|
||||
|
||||
**Why it happens:** Autoplay policies protect users from unexpected audio. Developers habitually run with autoplay unlocked in their dev browsers.
|
||||
|
||||
**How to avoid:**
|
||||
1. Require an explicit user gesture to initiate the voice mode session. The "start voice mode" button click counts as a user gesture — use it to create and unlock an `AudioContext` (`const ctx = new AudioContext(); await ctx.resume()`). Once unlocked, the AudioContext can play audio without further gesture requirements for the remainder of the session.
|
||||
2. Do not use `<audio>` element autoplay. Instead, decode the received audio blob with `AudioContext.decodeAudioData()` and play via `AudioBufferSourceNode` — this uses the already-unlocked context.
|
||||
3. Test on a clean browser profile with default settings to verify autoplay works before shipping.
|
||||
|
||||
**Warning signs:**
|
||||
- Audio plays fine in development but silently fails on first user visit
|
||||
- Browser DevTools console shows "play() failed because the user didn't interact with the document first"
|
||||
- `AudioContext` state is `suspended` when audio playback is attempted
|
||||
|
||||
**Phase to address:** Phase 3 (Web chat audio playback) — AudioContext unlock must be part of the "start voice mode" button handler design.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 41: Telegram Bot Token Stored in Environment Variable That Leaks Into Client Bundle
|
||||
|
||||
**What goes wrong:** The Telegram bot token (`TELEGRAM_BOT_TOKEN`) is a server-side secret. In a Vite/React monorepo, environment variables prefixed with `VITE_` are bundled into the client. A developer who adds `VITE_TELEGRAM_BOT_TOKEN` to expose it to React code, or who imports a `.env` file in a Vite config context, risks the token appearing in the compiled JS bundle served to the browser.
|
||||
|
||||
Even without `VITE_` prefix, if the token is loaded into a shared `packages/shared` module that is imported by both server and client code, Vite may tree-shake incorrectly and include it in the client bundle.
|
||||
|
||||
**Why it happens:** Monorepo shared package boundary between server and client code is not enforced by Vite's environment variable system. `VITE_` prefix is the documented mechanism but developers sometimes work around it.
|
||||
|
||||
**How to avoid:**
|
||||
1. Store `TELEGRAM_BOT_TOKEN` in `.env.server` (not `.env` or `.env.local` which Vite reads). Use `dotenv` on the server explicitly; never load this file through Vite.
|
||||
2. Validate at build time: add a lint check or Vite plugin that fails the build if any variable containing `TOKEN`, `SECRET`, or `KEY` appears in the client bundle.
|
||||
3. Keep Telegram bridge code entirely in `server/src/` — never in `packages/shared/` or any package imported by the UI.
|
||||
|
||||
**Warning signs:**
|
||||
- `TELEGRAM_BOT_TOKEN` value visible in browser DevTools → Sources → compiled JS
|
||||
- `.env` file containing `TELEGRAM_BOT_TOKEN` in the repo root (Vite reads this)
|
||||
- Telegram bridge code imported from a shared package used by UI code
|
||||
|
||||
**Phase to address:** Phase 1 (Telegram bridge setup) — secret handling policy must be validated before the bot token is added to any config file.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 42: Voice Waveform UI Causing Unnecessary Re-renders During Recording
|
||||
|
||||
**What goes wrong:** The recording waveform visualization reads audio amplitude data from an `AnalyserNode` via `requestAnimationFrame` at ~60fps. If this data is stored in React state (`useState`), every frame triggers a re-render of the component tree above the waveform. For a chat interface with many messages, this causes perceptible jank during recording (dropped frames, slow scrolling).
|
||||
|
||||
**Why it happens:** Waveform amplitude data is time-series state that changes at animation frame rate. React state is not designed for 60fps updates. The trap is copy-pasting a CanvasRenderingContext2D waveform example that stores amplitude in `useState` without considering the re-render cost.
|
||||
|
||||
**How to avoid:**
|
||||
1. Read amplitude data via `useRef` + direct Canvas 2D drawing inside the `requestAnimationFrame` loop. Never put waveform data in `useState`.
|
||||
2. Keep the waveform Canvas element isolated from the React component tree — render it outside the main message list DOM subtree (e.g., as an absolutely positioned overlay) so re-draws do not trigger layout recalculation for sibling components.
|
||||
3. Stop the `requestAnimationFrame` loop and disconnect the `AnalyserNode` immediately when recording stops — do not leave the loop running even at low amplitude.
|
||||
|
||||
**Warning signs:**
|
||||
- React DevTools Profiler shows high commit count and component renders during recording
|
||||
- Chat scroll is janky while recording
|
||||
- `requestAnimationFrame` callback showing `useState` setter calls
|
||||
|
||||
**Phase to address:** Phase 2 (Web chat mic button + waveform UI) — Canvas-direct rendering pattern must be established before the waveform component is built.
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 43: Missing ffmpeg on Production Mac Mini Silently Disabling Voice
|
||||
|
||||
**What goes wrong:** The entire STT + Telegram audio pipeline depends on ffmpeg for audio format conversion. If ffmpeg is not installed, the transcode step fails. Depending on error handling: (a) the transcription endpoint returns an HTTP 500, (b) it returns an empty transcription, or (c) it silently discards the audio and moves on. Outcomes (b) and (c) are worse than (a) because the user sees no error.
|
||||
|
||||
ffmpeg is not installed by default on macOS. It is available via Homebrew (`brew install ffmpeg`) but is not a Node.js dependency and will not be installed by `pnpm install`.
|
||||
|
||||
**Why it happens:** ffmpeg is a system dependency, not an npm dependency. It is easy to forget to document, and installation instructions are frequently missing from setup guides.
|
||||
|
||||
**How to avoid:**
|
||||
1. Add a startup check to the Nexus server: detect ffmpeg at boot time and log its version. If absent, log a prominent warning and disable the voice pipeline gracefully (return a clear error from the transcription endpoint, show a "voice unavailable" state in the UI).
|
||||
2. Add ffmpeg installation to the `npx buildthis` setup flow — if voice mode is enabled and ffmpeg is absent, the CLI should prompt to install it (`brew install ffmpeg`).
|
||||
3. Document ffmpeg as a hard prerequisite for voice features in the onboarding hardware detection step.
|
||||
|
||||
**Warning signs:**
|
||||
- `which ffmpeg` returns nothing on the production machine
|
||||
- Voice features work in development (developer has ffmpeg) but fail in any fresh install
|
||||
- Transcription endpoint returning 500 with no diagnostic message
|
||||
|
||||
**Phase to address:** Phase 1 (Whisper STT pipeline) — ffmpeg detection and graceful degradation must be implemented before any voice endpoint is exposed.
|
||||
|
||||
---
|
||||
|
||||
## Minor Pitfalls (v1.6)
|
||||
|
||||
---
|
||||
|
||||
### Pitfall 44: Telegram Agent Prefix Leaking Into Whisper Transcription Input
|
||||
|
||||
**What goes wrong:** The Telegram bridge formats replies as `[AgentName] response text`. If the bridge accidentally echoes the agent's own message back into the Whisper transcription pipeline (e.g., when relaying between agents or logging), Whisper transcribes the agent prefix along with the user's intended input. The resulting transcription contains `[Nexus] previous response...` prepended to whatever the user said. The agent receives this as its next input and behaves erratically.
|
||||
|
||||
**Why it happens:** Message relay and logging code passes message objects through the same pipeline as user input without filtering by sender type.
|
||||
|
||||
**How to avoid:**
|
||||
1. In the Telegram bridge handler, only transcribe messages where `update.message.from.id !== bot.id` — never transcribe messages sent by the bot itself.
|
||||
2. Apply a sender-type check before the transcription pipeline: if the message is from a bot, skip transcription and routing entirely.
|
||||
|
||||
**Phase to address:** Phase 3 (Telegram bridge) — sender-type filtering must be in the handler before end-to-end testing.
|
||||
|
||||
---
|
||||
|
||||
## Technical Debt Patterns
|
||||
|
||||
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|
||||
|
|
@ -504,6 +929,11 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Inline Piper model download on first TTS call | Zero extra onboarding step | Silent hang on first use; poor UX; perceived as broken feature | Never; always pre-warm |
|
||||
| Flat memory injection (all memories into every prompt) | Simple implementation | Context window overflow; irrelevant memories degrade response quality | Only for prototyping |
|
||||
| No mode discriminator on conversations table | No schema change needed | Mode cross-contamination; hard to query assistant vs. agent conversations | Acceptable with explicit agent-based filtering |
|
||||
| Spawn new Piper process per TTS request | Trivial first implementation | 200–800ms model reload per request; long-text truncation bug | Never; use persistent process |
|
||||
| Skip ffmpeg transcode, send raw audio to Whisper | One less dependency | Silent transcription failures on non-WAV formats; broken on Safari/Firefox | Never; transcode is mandatory |
|
||||
| Whisper in-process (Python subprocess per request) | No sidecar to manage | Model reload on every call; memory leak; concurrent request memory doubling | Only for one-off scripts, never for server |
|
||||
| Telegram webhook on local server | "Production-grade" pattern | Requires public URL; breaks behind NAT; doesn't work for this deployment | Never for local Mac Mini deployment |
|
||||
| `useState` for waveform animation data | Familiar React pattern | 60fps state updates cause continuous re-renders; UI jank during recording | Never; use `useRef` + Canvas direct |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -518,6 +948,13 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Google OAuth | Store access token in `localStorage` | Exchange code server-side; store token in Paperclip secrets; never expose to browser |
|
||||
| Upstream rebase after v1.5 | Forget to diff `OnboardingWizard.tsx` against upstream | Post-rebase protocol: diff the aliased-away file, integrate any new upstream props |
|
||||
| Apple Silicon VRAM | Report `os.totalmem()` as available GPU memory | Use `os.freemem()` with explicit copy: "unified memory, shared with OS" |
|
||||
| Whisper STT (server-side) | Pass raw browser WebM to Whisper | Transcode to 16 kHz mono WAV via ffmpeg first; Whisper expects PCM WAV |
|
||||
| Telegram voice messages | Assume same pipeline as browser audio | Telegram sends OGG/Opus at 48 kHz; same ffmpeg transcode step applies but source is CDN download |
|
||||
| Piper TTS (server-side binary) | Spawn new process per request | Keep Piper as persistent HTTP sidecar; model stays loaded between requests |
|
||||
| Telegram bot updates | Use webhooks for local deployment | Use long polling (`getUpdates`) — works behind NAT, no public URL required |
|
||||
| Telegram bot token | Add `VITE_TELEGRAM_BOT_TOKEN` for debugging | Keep token server-side only; never in Vite env variables |
|
||||
| Audio autoplay (browser) | Call `audio.play()` from SSE event handler | Unlock `AudioContext` on the "start voice mode" gesture; play via AudioBufferSourceNode |
|
||||
| ffmpeg dependency | Assume it is installed | Detect at server startup; degrade gracefully with clear error; add to `npx buildthis` setup |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -530,6 +967,11 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Piper TTS blocking main thread | UI freezes during synthesis | Run Piper WASM in a Web Worker; stream audio chunks as they generate | Models larger than small/medium quality |
|
||||
| Ollama model catalog loaded from disk on every request | File I/O on every recommendation call | Load and cache catalog at server startup, not per-request | High-frequency polling during onboarding |
|
||||
| MCP tool calls in the critical path of assistant response | Latency spikes when memory server is slow | Make MCP tool calls non-blocking where possible; set aggressive timeouts | MCP server under load or starting up |
|
||||
| Whisper model reload per STT request | 500MB+ memory spike; 2–5s startup delay per transcription | Persistent sidecar process; model loaded once at startup | First concurrent request pair |
|
||||
| Piper process spawn per TTS request | 200–800ms model reload per voice response | Persistent Piper process or HTTP sidecar | Any production traffic |
|
||||
| Telegram file download in main bot handler | Bot stops processing messages during download | Download + transcode + transcribe in async background task | Any voice message >2 seconds |
|
||||
| 60fps waveform data in React state | Chat UI jank during recording | Canvas-direct rendering via `useRef`, no React state for amplitude data | Any component tree with >20 chat messages |
|
||||
| No request queue on Whisper sidecar | Memory doubles under concurrent requests | Semaphore pattern; max 1 concurrent transcription | 2+ simultaneous voice inputs |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -542,6 +984,9 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Unauthenticated MCP endpoint exposure | External callers invoking memory read/write | MCP server bound to `localhost` only; board auth required for all tool calls |
|
||||
| Puter.js API key in browser bundle | Key exposure in DevTools | Server-side Puter adapter; no Puter credentials in browser |
|
||||
| Recording audio without explicit per-session consent indicator | Privacy violation perception | Show persistent recording indicator; stop all audio tracks immediately on stop |
|
||||
| `VITE_TELEGRAM_BOT_TOKEN` in environment | Token bundled into client JS; visible in DevTools | Server-only env vars for all bot tokens; no `VITE_` prefix for secrets |
|
||||
| Telegram bridge accepting messages from any chat ID | Unauthorized users can send commands to agent | Whitelist allowed `chatId` values in config; reject all other chat IDs |
|
||||
| Audio files persisted to disk without cleanup | Disk space exhaustion; audio data retained longer than needed | Delete transcoded WAV files immediately after Whisper transcription |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -555,6 +1000,10 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Hardware detection with false confidence | User loads model that OOMs | Label recommendations as "estimated" not "guaranteed"; add safety margin |
|
||||
| Mode selection before hardware detection | User picks "Personal AI Assistant" but their hardware can't run local models | Show hardware detection first; mode recommendation follows hardware capability |
|
||||
| Summary screen with no way to change a step | User made wrong choice earlier; stuck | Every summary item links back to the relevant step |
|
||||
| No intermediate "transcribing..." feedback on Telegram | User resends voice message thinking it was lost | Send immediate typing indicator + "Transcribing..." message to Telegram |
|
||||
| Voice auto-stop firing mid-sentence | Partial input submitted; confusing agent response | Use VAD library (Silero/`@ricky0123/vad-web`) not threshold detection; add manual stop button |
|
||||
| TTS reading agent prefix brackets aloud | Robotic "bracket Nexus bracket" audio | `sanitizeForTTS()` strips all formatting before synthesis |
|
||||
| Autoplay blocked with no feedback | Audio response plays silently; user thinks voice is broken | Unlock AudioContext on voice mode toggle; show clear "tap to enable audio" prompt if blocked |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -571,6 +1020,19 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
- [ ] **Apple Silicon VRAM copy:** Does the hardware detection screen say "unified memory" not "VRAM" for M-series chips?
|
||||
- [ ] **`npx buildthis` package name:** Has `npm search buildthis` been run to verify no collision?
|
||||
- [ ] **Upstream OnboardingWizard diff:** After the v1.5 wizard is built, has `OnboardingWizard.tsx` been diffed against upstream to check for new props that `NexusOnboardingWizard.tsx` needs to handle?
|
||||
- [ ] **Audio format transcode:** Does the `/transcribe` endpoint transcode to 16 kHz mono WAV before passing to Whisper? Test with a Safari recording (mp4) and Firefox recording (ogg).
|
||||
- [ ] **Telegram OGG pipeline:** Is the Telegram voice download → ffmpeg → Whisper path tested with a real Telegram voice message (not a local file)?
|
||||
- [ ] **Piper persistent process:** Is Piper running as a persistent process/sidecar, not spawned per request? Check `ps aux | grep piper` count during consecutive TTS calls.
|
||||
- [ ] **Whisper sidecar health check:** Does the server wait for the Whisper sidecar `/health` endpoint before routing STT requests?
|
||||
- [ ] **Voice mode flag propagation:** Is `isVoiceMode` present in the agent's system prompt log? Check server logs for a voice-flagged message.
|
||||
- [ ] **TTS sanitization:** Does `sanitizeForTTS()` strip agent prefixes, Markdown, and issue IDs? Test with a response containing backtick code blocks.
|
||||
- [ ] **Telegram session routing:** After sending a Telegram message, does the reply appear in Telegram (not web UI)? Check session `channel` field in DB.
|
||||
- [ ] **Long polling only:** Is `deleteWebhook` called on bot startup to ensure no stale webhook is registered?
|
||||
- [ ] **AudioContext unlock:** Does audio autoplay work on a fresh browser profile (no stored autoplay permissions)?
|
||||
- [ ] **ffmpeg at startup:** Does the server log ffmpeg version on startup? Does it gracefully disable voice with a clear error if ffmpeg is absent?
|
||||
- [ ] **Telegram token not in client bundle:** Does `grep -r "VITE_TELEGRAM" ui/src` return nothing?
|
||||
- [ ] **Telegram chat ID whitelist:** Does the bot reject messages from unknown chat IDs?
|
||||
- [ ] **Audio file cleanup:** Are transcoded WAV temp files deleted after transcription?
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -585,6 +1047,13 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Model catalog stale | LOW | Add fallback heuristic; document update process |
|
||||
| Onboarding probe auth-gated on board auth | MEDIUM | Add unauthenticated system/providers endpoint; update wizard to use new endpoint |
|
||||
| Mode contamination in conversations table | MEDIUM | Add agent-based filter to conversation queries; document the filtering convention |
|
||||
| Piper spawn-per-request shipped | MEDIUM | Wrap Piper in persistent HTTP sidecar; update spawn calls to HTTP requests; no data migration needed |
|
||||
| Whisper in-process (no sidecar) shipped | HIGH | Extract to standalone FastAPI/Flask sidecar; update all Node.js callers; retest on CPU fallback path |
|
||||
| Telegram webhook on local deploy | LOW | Call `deleteWebhook`; switch to `getUpdates` long polling; update bot startup code |
|
||||
| Telegram session channel overwritten | MEDIUM | Add dedicated `telegram` channel type; audit all `sessions_send` call sites; retest routing |
|
||||
| `VITE_TELEGRAM_BOT_TOKEN` in bundle shipped | HIGH | Rotate bot token immediately; move to server-only env var; rebuild and redeploy |
|
||||
| ffmpeg missing, voice silently broken | LOW | Install ffmpeg; add startup check to catch future regressions |
|
||||
| Audio autoplay blocked | LOW | Implement AudioContext unlock on voice mode toggle; test on clean browser profile |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -607,6 +1076,24 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
| Subscription detection false positives (24) | Phase 3 — Subscription Detection | Test: revoke an API key; verify wizard shows "unverified" not "ready" |
|
||||
| Project handoff losing context (25) | Phase 5 — Persistent Memory | Test: handoff includes conversation ID, not just flat text summary |
|
||||
| Model catalog staleness (26) | Phase 1 — Hardware Detection | Test: install an uncatalogued Ollama model; verify fallback heuristic fires |
|
||||
| Audio format mismatch browser → Whisper (27) | v1.6 Phase 1 — Whisper STT | Test: record on Safari + Firefox; verify both transcribe correctly |
|
||||
| Telegram OGG/Opus 48 kHz mismatch (28) | v1.6 Phase 3 — Telegram audio | Test: send real Telegram voice message; verify transcription succeeds |
|
||||
| Piper spawn-per-request (29) | v1.6 Phase 2 — Piper TTS | Verify: `ps aux \| grep piper` shows one persistent process, not N per request |
|
||||
| Whisper model reload per request (30) | v1.6 Phase 1 — Whisper sidecar | Verify: memory stays flat across 10 consecutive transcription requests |
|
||||
| Browser silence detection too eager/late (31) | v1.6 Phase 2 — Web mic button | Test: natural mid-sentence pause does not auto-submit; quiet room does not stall recording |
|
||||
| Voice mode flag not propagated (32) | v1.6 Phase 1 — Voice mode flag | Test: voice-flagged message; verify agent system prompt contains voice formatting instruction |
|
||||
| Telegram competing session identity (33) | v1.6 Phase 3 — Telegram session | Test: Telegram message reply arrives in Telegram, not web UI |
|
||||
| Telegram webhook on local deploy (34) | v1.6 Phase 1 — Telegram setup | Verify: `getWebhookInfo` returns empty webhook URL; bot uses long polling |
|
||||
| TTS synthesizing agent prefixes verbatim (35) | v1.6 Phase 2 — TTS sanitization | Test: agent reply with `[Nexus]` prefix; verify audio does not contain "bracket" |
|
||||
| Whisper model too large for CPU fallback (36) | v1.6 Phase 1 — CPU fallback | Benchmark: transcribe 10s clip on CPU-only path; must complete in <5s |
|
||||
| Telegram file download blocking event loop (37) | v1.6 Phase 3 — Telegram async | Test: send voice message; verify text messages still processed during download |
|
||||
| Piper binary not found in service PATH (38) | v1.6 Phase 2 — Piper binary config | Test: start server via launchctl; verify Piper path resolves |
|
||||
| Dual output two LLM calls doubling latency (39) | v1.6 Phase 1 — Output pattern design | Verify: single LLM call per voice message in server logs |
|
||||
| Audio autoplay blocked by browser policy (40) | v1.6 Phase 3 — Web audio playback | Test: fresh browser profile; voice response plays without user interaction after voice mode toggle |
|
||||
| Telegram bot token in client bundle (41) | v1.6 Phase 1 — Telegram setup | Verify: `grep -r TELEGRAM ui/dist` returns nothing |
|
||||
| Waveform causing React re-renders (42) | v1.6 Phase 2 — Waveform UI | Profile: React DevTools shows no re-renders in parent components during recording |
|
||||
| ffmpeg missing on production (43) | v1.6 Phase 1 — Whisper STT | Verify: server logs ffmpeg version on startup; `which ffmpeg` on production machine |
|
||||
| Telegram agent prefix in transcription input (44) | v1.6 Phase 3 — Telegram handler | Verify: bot-originated messages are filtered before the transcription pipeline |
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -630,7 +1117,19 @@ The existing code in `getRecommendedModel()` silently skips models not in the ca
|
|||
- [Piper TTS WASM cold start](https://github.com/rhasspy/piper/issues/352) — first-run download, OPFS caching, warmup pattern
|
||||
- [OAuth PKCE SPA Best Practices — Curity](https://curity.io/resources/learn/spa-best-practices/) — sessionStorage for verifiers, server-side token storage
|
||||
- [AI Agent Memory — Redis](https://redis.io/blog/ai-agent-memory-stateful-systems/) — context window overflow, hybrid vector+graph architecture
|
||||
- [faster-whisper audio format and WebM handling](https://github.com/SYSTRAN/faster-whisper/issues/671) — WebM input requires ffmpeg transcode to 16 kHz WAV (MEDIUM confidence)
|
||||
- [Whisper concurrent requests memory issue](https://github.com/SYSTRAN/faster-whisper/issues/1419) — each worker loads full model; use semaphore pattern (HIGH confidence — official issue tracker)
|
||||
- [Piper long message truncation bug](https://github.com/home-assistant/addons/issues/3360) — messages >500 chars cause silent truncation; chunk input (HIGH confidence — confirmed upstream bug)
|
||||
- [Piper persistent process discussion](https://github.com/rhasspy/piper/issues/376) — subprocess exits after each synthesis; keep alive pattern documented (HIGH confidence)
|
||||
- [Telegram webhooks guide — official](https://core.telegram.org/bots/webhooks) — port and SSL requirements; mixing polling/webhook forbidden (HIGH confidence — official docs)
|
||||
- [Telegram voice message OGG/Opus format](https://dev.to/techresolve/solved-convert-voice-memos-from-telegram-to-text-using-openai-whisper-api-41al) — OGG Opus 48 kHz; ffmpeg transcode required (MEDIUM confidence)
|
||||
- [Telegram session channel overwrite bug](https://github.com/openclaw/openclaw/issues/31671) — `sessions_send` overwrites `channel` field to `webchat` (HIGH confidence — confirmed bug in similar system)
|
||||
- [Silero VAD vs WebRTC VAD comparison](https://picovoice.ai/blog/best-voice-activity-detection-vad-2025/) — WebRTC VAD 50% TPR vs Silero 87.7% at 5% FPR (HIGH confidence — benchmark data)
|
||||
- [Whisper on Apple Silicon M4 benchmarks](https://dev.to/theinsyeds/whisper-speech-recognition-on-mac-m4-performance-analysis-and-benchmarks-2dlp) — all models process 10s audio faster than real-time on M4 with MLX (MEDIUM confidence)
|
||||
- [MediaRecorder cross-browser format differences](https://media-codings.com/articles/recording-cross-browser-compatible-media) — Chrome/Firefox/Safari produce different formats and bitrates (HIGH confidence)
|
||||
- [Node.js child_process binary PATH issue](https://github.com/nodejs/help/issues/163) — service environment PATH differs from interactive shell; use absolute paths (HIGH confidence)
|
||||
- [Whisper memory leak](https://github.com/openai/whisper/discussions/605) — RAM not fully released after transcription in some environments (MEDIUM confidence)
|
||||
|
||||
---
|
||||
*Pitfalls research for: Nexus v1.5 — Smart Onboarding + Personal AI Assistant*
|
||||
*Researched: 2026-04-02*
|
||||
*Pitfalls research for: Nexus v1.5 — Smart Onboarding + Personal AI Assistant; v1.6 — Voice Pipeline + Telegram Bridge*
|
||||
*Researched: 2026-04-02; Updated: 2026-04-03*
|
||||
|
|
|
|||
|
|
@ -1,395 +1,320 @@
|
|||
# Technology Stack: v1.5 Smart Onboarding + Personal AI Assistant
|
||||
# Technology Stack: v1.6 Voice Pipeline + Telegram Bridge
|
||||
|
||||
**Project:** Nexus v1.5 — additive to existing fork maintenance stack (see prior milestone research for branding/fork strategy)
|
||||
**Researched:** 2026-04-02
|
||||
**Scope:** NEW libraries only — Puter.js, hardware detection, Whisper STT + Piper TTS, OAuth, `npx buildthis` CLI, persistent memory
|
||||
**Confidence:** MEDIUM-HIGH (most verified via official docs; a few version numbers from npm search only)
|
||||
**Project:** Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client)
|
||||
**Researched:** 2026-04-03
|
||||
**Scope:** NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge
|
||||
**Confidence:** MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified)
|
||||
|
||||
---
|
||||
|
||||
## Existing Stack (Do Not Change)
|
||||
## Context: What v1.5 Already Installed
|
||||
|
||||
The following are already installed and working. Zero changes needed:
|
||||
Do not re-add or re-research these — they are in `server/package.json` or `ui/package.json`:
|
||||
|
||||
| Area | What's There | Location |
|
||||
|------|-------------|----------|
|
||||
| CLI framework | Commander.js `^13.1.0` + `@clack/prompts ^0.10.0` | `cli/package.json` |
|
||||
| Hardware/Ollama | Custom detection (`v1.4`) + `systeminformation` likely via existing adapter | `packages/adapters/hermes` |
|
||||
| Server auth | `better-auth 1.4.18` | `server/package.json` |
|
||||
| UI | React 19, Vite 6, Tailwind v4, TanStack Query v5 | `ui/package.json` |
|
||||
| DB | LibSQL/Drizzle ORM | `server/package.json` |
|
||||
| Package | Location | Purpose |
|
||||
|---------|----------|---------|
|
||||
| `smart-whisper ^0.8.1` | `server/` | Whisper.cpp Node bindings (recommended in v1.5 STACK.md) |
|
||||
| `@mintplex-labs/piper-tts-web ^1.0.4` | `ui/` | Browser-side Piper WASM (already installed) |
|
||||
| `systeminformation 5` | `server/` | Hardware detection |
|
||||
| `multer ^2.0.2` | `server/` | Multipart upload (already handles audio blob uploads) |
|
||||
| `express ^5.1.0` | `server/` | HTTP server |
|
||||
|
||||
The existing `VoiceRecordButton` already uses `MediaRecorder` + `POST /api/transcribe`. The existing `usePiperTts` hook already uses `@mintplex-labs/piper-tts-web` for browser-side TTS. The v1.6 work **extends** this — adding silence detection, server-side TTS, and Telegram relay.
|
||||
|
||||
---
|
||||
|
||||
## New Libraries by Feature Area
|
||||
|
||||
### 1. Puter.js — Zero-Config Cloud AI
|
||||
### 1. Browser VAD (Silence Detection + Auto-Send)
|
||||
|
||||
**Package:** `@heyputer/puter.js`
|
||||
**Version:** latest (no stable semver pinned on npm — use `@latest` and lock in pnpm-lock)
|
||||
**Where it lives:** `ui/` only — Puter.js is a frontend-first browser SDK
|
||||
**Package:** `@ricky0123/vad-react`
|
||||
**Version:** `^0.0.36`
|
||||
**Where it lives:** `ui/` only — browser-side ONNX model running off the main thread
|
||||
|
||||
**Why:** 500+ models (GPT-4o, Claude, Gemini, Grok, DeepSeek) with zero API keys and zero developer billing. Users authenticate with their own Puter account; usage cost falls on the user, not the developer. This is the project's "zero-config cloud" tier — the entire value prop depends on this library.
|
||||
**Why:** The existing `VoiceRecordButton` requires the user to manually tap Stop. `@ricky0123/vad-react` uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires `onSpeechEnd` automatically with the speech segment as a `Float32Array` at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the `userSpeaking` state flag.
|
||||
|
||||
**How the API works:**
|
||||
**React 19 compatibility:** Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No `--legacy-peer-deps` needed.
|
||||
|
||||
**API surface:**
|
||||
|
||||
```typescript
|
||||
// Browser only — import via script tag or bundler
|
||||
import Puter from "@heyputer/puter.js";
|
||||
import { useMicVAD } from "@ricky0123/vad-react";
|
||||
|
||||
// Chat (streaming)
|
||||
const stream = await puter.ai.chat("Hello", {
|
||||
model: "gpt-4o",
|
||||
stream: true,
|
||||
const vad = useMicVAD({
|
||||
startOnLoad: false, // user must explicitly start
|
||||
positiveSpeechThreshold: 0.3, // sensitivity
|
||||
minSpeechMs: 400, // ignore sub-400ms blips
|
||||
redemptionMs: 1400, // 1.4s silence = end of utterance
|
||||
onSpeechEnd: (audio: Float32Array) => {
|
||||
// audio is 16kHz Float32Array — matches what Whisper expects
|
||||
sendToTranscribeEndpoint(float32ToWav(audio));
|
||||
},
|
||||
});
|
||||
for await (const part of stream) {
|
||||
process.stdout.write(part?.text ?? "");
|
||||
}
|
||||
|
||||
// Image generation, TTS, STT also available under puter.ai.*
|
||||
// vad.userSpeaking — boolean for waveform animation
|
||||
// vad.listening — boolean for mic state
|
||||
// vad.start() / vad.pause()
|
||||
```
|
||||
|
||||
**Integration point:** New `PuterAdapter` in `packages/adapters/` following the existing adapter pattern. The adapter wraps `puter.ai.chat()` and maps to the shared `AdapterMessage` type. Keep it display-layer only — no server-side Puter calls.
|
||||
**Key integration note:** `onSpeechEnd` delivers a `Float32Array` at 16000Hz — this maps directly to what `smart-whisper` expects on the server side, so no resampling is needed in the browser-to-server path.
|
||||
|
||||
**Constraint:** Puter.js runs in browser context only. Do NOT add it to `server/` or `cli/`. The adapter must be a frontend-only workspace package or inlined into the UI.
|
||||
|
||||
**Confidence: HIGH** — Official docs verified at developer.puter.com. User-pays model confirmed.
|
||||
**Confidence: MEDIUM** — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load.
|
||||
|
||||
---
|
||||
|
||||
### 2. Hardware Detection — GPU, RAM, Apple Silicon
|
||||
### 2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG)
|
||||
|
||||
**Package:** `systeminformation`
|
||||
**Version:** `^5.31.5` (latest stable; v6 TypeScript rewrite is in progress but not released)
|
||||
**Where it lives:** `server/` (runs on the Mac Mini; browser APIs cannot access hardware)
|
||||
**Package:** `ffmpeg-static`
|
||||
**Version:** `^5.2.0` (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows)
|
||||
**Where it lives:** `server/` — provides the binary path; invoked via Node.js `child_process.spawn`
|
||||
|
||||
**Why:** The only comprehensive cross-platform system info library for Node.js with 20M+ monthly downloads. Covers CPU, total RAM, GPU model/VRAM, and Apple Silicon GPU core count — exactly what's needed for model recommendation. Alternatives (`detect-gpu`, `gpu-info`) are browser-only or Windows-only.
|
||||
**Why `ffmpeg-static` over alternatives:**
|
||||
- `fluent-ffmpeg` was archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency
|
||||
- `@ffmpeg-installer/ffmpeg` — last updated 2022, stale binary (FFmpeg 4.x)
|
||||
- `ffmpeg-static` — actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed)
|
||||
- Direct `child_process.spawn("ffmpeg", [...])` with the binary path from `ffmpeg-static` is the recommended approach for 2025+
|
||||
|
||||
**Key functions for v1.5:**
|
||||
**Two conversions needed:**
|
||||
|
||||
**a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)**
|
||||
|
||||
```typescript
|
||||
import si from "systeminformation";
|
||||
import ffmpegPath from "ffmpeg-static";
|
||||
import { spawn } from "node:child_process";
|
||||
|
||||
// Total system RAM
|
||||
const mem = await si.mem(); // mem.total in bytes
|
||||
|
||||
// GPU info — works on macOS, Windows, Linux
|
||||
const graphics = await si.graphics();
|
||||
// graphics.controllers[0].vram — VRAM in MB (dedicated GPU)
|
||||
// graphics.controllers[0].cores — GPU cores (Apple Silicon only)
|
||||
// graphics.controllers[0].model — e.g. "Apple M4 Pro"
|
||||
```
|
||||
|
||||
**Apple Silicon nuance:** Apple Silicon has unified memory — there is no separate VRAM. `si.graphics()` returns `vram: 0` and populates `cores` with GPU core count instead. The model recommendation logic must handle this: use `mem.total` as effective VRAM for Apple Silicon, scaled by a configurable fraction (typically 0.75 since OS+apps compete for the same pool).
|
||||
|
||||
**Existing usage in v1.4:** Ollama detection and RAM/VRAM recommendations are already implemented. This is an additive enhancement — if `systeminformation` is not yet imported in the server, add it. If it is, extend the existing detection service.
|
||||
|
||||
**Confidence: HIGH** — Verified via systeminformation.io official docs. Apple Silicon behavior confirmed via GPU core detection doc.
|
||||
|
||||
---
|
||||
|
||||
### 3. Whisper STT — Speech to Text (CPU-capable)
|
||||
|
||||
**Recommendation:** `smart-whisper`
|
||||
**Version:** `^0.8.1` (latest as of October 2025)
|
||||
**Where it lives:** `server/` as an optional service (graceful degradation if model not downloaded)
|
||||
|
||||
**Why over alternatives:**
|
||||
- `smart-whisper`: Native Node.js addon wrapping whisper.cpp directly. Supports loading one model for parallel inferences. Auto-enables Apple Neural Engine acceleration on macOS. Pre-built binaries for macOS arm64 (Mac Mini M4).
|
||||
- `nodejs-whisper` (v0.2.9, 10 months old): Older, CPU-focused, spawns a subprocess. Works but slower and less maintained.
|
||||
- `whisper-node` (v1.1.1, 2 years old): Abandoned.
|
||||
|
||||
**Model recommendation for Mac Mini M4:**
|
||||
- `base.en` model (~140MB) — good balance of speed/accuracy for English voice input
|
||||
- `small.en` model (~460MB) — better accuracy if user has RAM to spare
|
||||
- Models download lazily on first voice use; onboarding should gate voice on model availability
|
||||
|
||||
**Integration pattern:**
|
||||
|
||||
```typescript
|
||||
import { Whisper } from "smart-whisper";
|
||||
|
||||
const whisper = new Whisper("base.en"); // downloads on first call
|
||||
const transcript = await whisper.transcribe(audioBuffer, { language: "en" });
|
||||
```
|
||||
|
||||
**Server endpoint:** Add `POST /api/voice/transcribe` that accepts audio blob (WAV/WebM from browser MediaRecorder), returns transcript JSON. The existing v1.3 voice input uses browser-side Web Speech API as a fallback — this is the local/offline upgrade path.
|
||||
|
||||
**Confidence: MEDIUM** — Package verified on npm and GitHub. Version from GitHub releases page. Apple Silicon acceleration confirmed in README. No production deployment data for this specific version.
|
||||
|
||||
---
|
||||
|
||||
### 4. Piper TTS — Text to Speech (CPU-capable)
|
||||
|
||||
**Recommendation:** Spawn `piper` binary via `child_process`, do NOT use a Node.js wrapper library
|
||||
**Why:** No mature, production-ready Node.js binding for Piper TTS exists as of April 2026. The `@mintplex-labs/piper-tts-web` package is browser-only. ONNX-based implementations exist in Python (`piper-onnx`) and partially in JavaScript for Bun, but none are packaged for Node.js production use.
|
||||
|
||||
**Approach:**
|
||||
|
||||
```typescript
|
||||
import { spawn } from "child_process";
|
||||
import path from "path";
|
||||
|
||||
// piper binary downloaded to ~/.paperclip/voice/piper
|
||||
// voice model downloaded to ~/.paperclip/voice/models/
|
||||
async function synthesize(text: string, modelPath: string): Promise<Buffer> {
|
||||
function webmToWav16k(inputBuffer: Buffer): Promise<Buffer> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const proc = spawn("piper", [
|
||||
"--model", modelPath,
|
||||
"--output-raw",
|
||||
const proc = spawn(ffmpegPath!, [
|
||||
"-i", "pipe:0", // read from stdin
|
||||
"-acodec", "pcm_s16le",
|
||||
"-ac", "1", // mono
|
||||
"-ar", "16000", // 16kHz
|
||||
"-f", "wav",
|
||||
"pipe:1", // write to stdout
|
||||
]);
|
||||
const chunks: Buffer[] = [];
|
||||
proc.stdout.on("data", (chunk) => chunks.push(chunk));
|
||||
proc.stdout.on("end", () => resolve(Buffer.concat(chunks)));
|
||||
proc.stdin.write(text);
|
||||
const out: Buffer[] = [];
|
||||
proc.stdout.on("data", (c: Buffer) => out.push(c));
|
||||
proc.stdout.on("end", () => resolve(Buffer.concat(out)));
|
||||
proc.stderr.on("data", () => {}); // suppress ffmpeg banner
|
||||
proc.on("error", reject);
|
||||
proc.stdin.write(inputBuffer);
|
||||
proc.stdin.end();
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Alternative for pure-JS TTS (fallback/cloud):** The browser's `window.speechSynthesis` API covers the cloud and basic local cases without any server dependency. Use Web Speech API as the default TTS tier; offer Piper as an optional "high-quality offline voice" that the user must enable explicitly.
|
||||
|
||||
**Piper binary distribution:** During onboarding, detect if piper binary exists at `~/.paperclip/voice/piper`. If not, show download prompt. Use `https://github.com/rhasspy/piper/releases` to fetch the macOS arm64 binary. Store in `~/.paperclip/` (Nexus never renames this dir per PROJECT.md constraints).
|
||||
|
||||
**Recommended voice model for Mac Mini M4:** `en_US-lessac-medium` (~63MB) — good quality, fast on Apple Silicon.
|
||||
|
||||
**Confidence: MEDIUM** — Based on official Piper GitHub + community blog posts (Bun runtime example). Subprocess approach is the proven path. ONNX-native Node.js path is theoretically possible but no maintained package exists.
|
||||
|
||||
---
|
||||
|
||||
### 5. OAuth Flows — Google Gemini + OpenAI Free Tiers
|
||||
|
||||
**Recommendation:** `openid-client` v6
|
||||
**Version:** `^6.8.2` (latest stable, complete v6 API rewrite)
|
||||
**Where it lives:** `server/` — OAuth flows run server-side with PKCE
|
||||
|
||||
**Why openid-client over passport.js:**
|
||||
- Passport.js adds middleware abstraction that conflicts with Nexus's existing `better-auth` setup (already in `server/package.json`)
|
||||
- `openid-client` v6 is a certified OAuth 2/OIDC client that handles PKCE natively without middleware
|
||||
- Works alongside `better-auth` — openid-client handles the provider OAuth dance; better-auth handles the Nexus session
|
||||
|
||||
**What it provides:**
|
||||
- Authorization Code Flow with PKCE (required by OAuth 2.1)
|
||||
- Discovery via `.well-known/openid-configuration` — works for both Google and any OpenAI-compatible provider
|
||||
- Token refresh, revocation, introspection
|
||||
|
||||
**Integration pattern:**
|
||||
**b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)**
|
||||
|
||||
```typescript
|
||||
import * as client from "openid-client";
|
||||
|
||||
// Google discovery
|
||||
const googleConfig = await client.discovery(
|
||||
new URL("https://accounts.google.com"),
|
||||
process.env.GOOGLE_CLIENT_ID!,
|
||||
process.env.GOOGLE_CLIENT_SECRET!
|
||||
);
|
||||
|
||||
// Generate PKCE challenge
|
||||
const codeVerifier = client.randomPKCECodeVerifier();
|
||||
const codeChallenge = await client.calculatePKCECodeChallenge(codeVerifier);
|
||||
```
|
||||
|
||||
**Note on "zero sign-up":** Puter.js handles the zero-API-key tier. OAuth is the tier above that — where users already have Google/OpenAI accounts and want to connect them. Keep these separate in the onboarding UI: Puter tier requires zero setup; OAuth tier shows "Connect your Google account" CTA.
|
||||
|
||||
**Server routes to add:**
|
||||
- `GET /api/oauth/google/start` — initiate flow, return redirect URL
|
||||
- `GET /api/oauth/google/callback` — exchange code for tokens, store encrypted
|
||||
- Same pattern for OpenAI when their OAuth flow is stable
|
||||
|
||||
**Confidence: MEDIUM** — openid-client v6 verified via GitHub and npm. Google OIDC integration confirmed. OpenAI's free tier OAuth specifics are LOW confidence (their free tier structure changes frequently).
|
||||
|
||||
---
|
||||
|
||||
### 6. `npx buildthis` — CLI Bootstrapper
|
||||
|
||||
**No new library needed.** The package structure is a standard npm pattern.
|
||||
|
||||
**What to build:** A new npm package `buildthis` (or scoped `@nexus/buildthis`) published to npm. When run via `npx buildthis`, it:
|
||||
1. Detects if Nexus server is running locally (`localhost:4000` or configured port)
|
||||
2. If yes: opens browser to onboarding URL
|
||||
3. If no: guides user through one-command install (Docker or native)
|
||||
|
||||
**Package structure:**
|
||||
|
||||
```
|
||||
cli-bootstrapper/ # New top-level directory in the Nexus monorepo
|
||||
package.json # name: "buildthis", bin: { "buildthis": "./dist/index.js" }
|
||||
src/
|
||||
index.ts # #!/usr/bin/env node shebang entry
|
||||
dist/ # bundled by esbuild (same config as existing CLI)
|
||||
```
|
||||
|
||||
**`package.json` bin field:**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "buildthis",
|
||||
"version": "0.1.0",
|
||||
"bin": {
|
||||
"buildthis": "./dist/index.js"
|
||||
},
|
||||
"files": ["dist"]
|
||||
function wavToOggOpus(inputBuffer: Buffer): Promise<Buffer> {
|
||||
return new Promise((resolve, reject) => {
|
||||
const proc = spawn(ffmpegPath!, [
|
||||
"-i", "pipe:0",
|
||||
"-c:a", "libopus",
|
||||
"-b:a", "32k",
|
||||
"-f", "ogg",
|
||||
"pipe:1",
|
||||
]);
|
||||
// ... same pattern as above
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Key constraint:** Keep `buildthis` dependencies minimal. `npx` downloads and installs the package fresh on each invocation. Heavy dependencies (e.g. Commander.js, Inquirer) add 200-500ms to startup. Use Node.js built-ins (`readline`, `https`, `child_process`) wherever possible. Acceptable: `@clack/prompts` (already a project dependency, ~20KB).
|
||||
|
||||
**Existing CLI packages already use:** Commander.js `^13.1.0`, `@clack/prompts ^0.10.0`, `picocolors`. Reuse these — they're already in the project's lockfile.
|
||||
|
||||
**Confidence: HIGH** — npx bin-field pattern is official Node.js documentation. No novel library choices required.
|
||||
**Confidence: MEDIUM** — `ffmpeg-static` macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025.
|
||||
|
||||
---
|
||||
|
||||
### 7. Persistent Memory — Personal AI Assistant
|
||||
### 3. Telegram Bridge
|
||||
|
||||
**Recommendation:** Two-layer approach — SQLite for structured memory + local vector search for semantic recall
|
||||
**Package:** `grammy`
|
||||
**Version:** `^1.41.1` (latest, supports Bot API 9.6)
|
||||
**Where it lives:** `server/` as an optional singleton service — only starts if `TELEGRAM_BOT_TOKEN` is set
|
||||
|
||||
**Layer 1 — Structured facts:** Use the existing LibSQL/Drizzle ORM stack. Add a `memories` table with columns: `id`, `user_id`, `content` (text), `embedding` (blob), `created_at`, `source` (`conversation` | `explicit`). No new DB library needed — LibSQL supports this schema.
|
||||
**Why grammy over alternatives:**
|
||||
- `grammy` has 1.4M weekly downloads vs `telegraf` at 900K — grammY is now the higher-adoption choice
|
||||
- grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs
|
||||
- `node-telegram-bot-api` is lower-level with no middleware, requires more boilerplate for this use case
|
||||
- grammY's file handling API (`ctx.getFile()`) is the cleanest for the voice relay use case
|
||||
|
||||
**Layer 2 — Semantic search:** `vectra`
|
||||
**Version:** `^0.12.3` (last published ~1 month ago)
|
||||
**Where it lives:** `server/` as an optional memory service
|
||||
|
||||
**Why vectra:**
|
||||
- Zero infrastructure — index is a folder of JSON files on disk. Fits `~/.paperclip/memory/` perfectly.
|
||||
- Sub-millisecond lookup for small corpora (<10K items, typical personal assistant use)
|
||||
- TypeScript-native, MIT licensed
|
||||
- No cloud dependency, no server process
|
||||
|
||||
**Embeddings for vectra:** Use Ollama's `nomic-embed-text` model (already in the Ollama ecosystem from v1.4). This avoids any OpenAI API key dependency for the memory layer.
|
||||
**What the bridge needs to do (thin relay only — per PROJECT.md):**
|
||||
|
||||
```typescript
|
||||
import { LocalIndex } from "vectra";
|
||||
import ollama from "ollama"; // already installed via hermes adapter
|
||||
import { Bot, Context } from "grammy";
|
||||
|
||||
const index = new LocalIndex(path.join(process.env.PAPERCLIP_HOME!, "memory"));
|
||||
const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!);
|
||||
|
||||
// Store memory
|
||||
const { embeddings } = await ollama.embeddings({ model: "nomic-embed-text", prompt: text });
|
||||
await index.insertItem({ vector: embeddings[0], metadata: { content: text, date: Date.now() } });
|
||||
// Relay text messages to Nexus chat API
|
||||
bot.on("message:text", async (ctx) => {
|
||||
const response = await relayToNexus(ctx.message.text, ctx.from.id);
|
||||
await ctx.reply(response);
|
||||
});
|
||||
|
||||
// Recall memories
|
||||
const results = await index.queryItems(queryEmbedding, 5);
|
||||
// Receive voice messages — download OGG, transcribe, relay
|
||||
bot.on("message:voice", async (ctx) => {
|
||||
const file = await ctx.getFile();
|
||||
// file.download() returns Buffer (grammY handles temp URL expiry)
|
||||
const oggBuffer = await downloadFile(file.file_path!, bot.token);
|
||||
const transcript = await transcribeOgg(oggBuffer); // via smart-whisper
|
||||
const response = await relayToNexus(transcript, ctx.from.id);
|
||||
await ctx.reply(response);
|
||||
});
|
||||
|
||||
// Run with long polling (no webhook needed for single-user local setup)
|
||||
bot.start();
|
||||
```
|
||||
|
||||
**Why NOT mem0ai:** `mem0ai` npm package defaults to OpenAI for both the LLM and embedder. Local/offline configuration is not documented in the Node SDK (only the Python SDK supports local providers). Using it would introduce an OpenAI API key hard dependency that conflicts with the "zero-config local-first" goal.
|
||||
**Voice message format from Telegram:** Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with `ffmpeg-static` pipeline: `ogg→wav16k`.
|
||||
|
||||
**Why NOT LangChain MemoryVectorStore:** LangChain JS is 40MB+ of dependencies and would be the largest single addition to the project. For a personal assistant's memory layer, vectra + Ollama embeddings is 1/20th the footprint.
|
||||
**To send TTS back to Telegram:** Convert Piper WAV output → OGG Opus via `ffmpeg-static`, then use `ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg"))`.
|
||||
|
||||
**Confidence: MEDIUM** — vectra verified on npm/GitHub. Ollama embeddings confirmed via ollama.com docs. mem0ai limitation confirmed via their Node SDK docs (no local LLM option documented).
|
||||
**Long polling vs webhook:** Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed.
|
||||
|
||||
**Confidence: HIGH** — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge.
|
||||
|
||||
---
|
||||
|
||||
### 4. Server-Side Piper TTS (Audio Response Endpoint)
|
||||
|
||||
**No new library needed.** The v1.5 STACK.md already specified the `child_process.spawn` approach with the Piper binary.
|
||||
|
||||
**What v1.6 adds on top of v1.5:**
|
||||
- A new Express route: `POST /api/voice/synthesize` that accepts `{ text, voice? }` and returns raw WAV audio (`Content-Type: audio/wav`)
|
||||
- This endpoint is used by both the web chat playback (browser `<audio>` element) and the Telegram bridge (convert WAV → OGG for `sendVoice`)
|
||||
- Voice mode flag: requests with `voiceMode: true` should receive a condensed plain-language response (no markdown, no code blocks) — this is a prompt instruction layer, not a library
|
||||
|
||||
**Response shape:**
|
||||
|
||||
```
|
||||
POST /api/voice/synthesize
|
||||
Body: { text: string, voice?: "en_US-lessac-medium" }
|
||||
Response: audio/wav binary stream
|
||||
```
|
||||
|
||||
**Confidence: HIGH** — this is an implementation pattern, not a new library.
|
||||
|
||||
---
|
||||
|
||||
### 5. Audio Playback (Web Chat)
|
||||
|
||||
**No new library needed.** The browser's native `<audio>` element handles WAV and OGG playback. The existing `TtsButton` uses `new Audio(url)` already. The v1.6 enhancement is:
|
||||
|
||||
- Upgrade from `new Audio(blob)` to a proper inline `<audio controls>` player with auto-play toggle stored in settings
|
||||
- Use `URL.createObjectURL(blob)` for streaming playback of TTS responses
|
||||
- Waveform visualization via `AnalyserNode` from the Web Audio API — no library needed
|
||||
|
||||
**Confidence: HIGH** — Web Audio API and `<audio>` are native browser APIs. No library required.
|
||||
|
||||
---
|
||||
|
||||
## Installation Summary
|
||||
|
||||
```bash
|
||||
# server/ — add these dependencies
|
||||
pnpm --filter @paperclipai/server add systeminformation openid-client vectra
|
||||
# ui/ — add VAD for silence detection + auto-send
|
||||
pnpm --filter @paperclipai/ui add @ricky0123/vad-react
|
||||
|
||||
# server/ — smart-whisper (optional, for local STT)
|
||||
pnpm --filter @paperclipai/server add smart-whisper
|
||||
# server/ — add FFmpeg binary (for audio format conversion) and Telegram bot
|
||||
pnpm --filter @paperclipai/server add ffmpeg-static grammy
|
||||
|
||||
# ui/ — Puter.js frontend SDK
|
||||
pnpm --filter @paperclipai/ui add @heyputer/puter.js
|
||||
|
||||
# New package for npx bootstrapper (separate publish)
|
||||
# cli-bootstrapper/package.json — no new external deps beyond @clack/prompts
|
||||
# server/ — types (ffmpeg-static ships its own types; grammy is TS-native)
|
||||
# No @types/* needed for grammy
|
||||
# ffmpeg-static types are included in the package
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
| Feature | Recommended | Alternative | Why Not |
|
||||
|---------|-------------|-------------|---------|
|
||||
| Hardware detection | `systeminformation ^5.31.5` | `detect-gpu` | Browser-only; Node.js usage not supported |
|
||||
| Hardware detection | `systeminformation ^5.31.5` | `gpu-info` | Windows-only; no macOS/Linux support |
|
||||
| STT | `smart-whisper ^0.8.1` | `nodejs-whisper ^0.2.9` | Subprocess-based, 10 months stale, slower on Apple Silicon |
|
||||
| STT | `smart-whisper ^0.8.1` | Cloud Whisper API | Requires API key; breaks offline/local-first promise |
|
||||
| TTS | Piper binary via `child_process` | `@mintplex-labs/piper-tts-web` | Browser-only npm package, cannot run in Node.js server |
|
||||
| TTS | Piper binary | `sherpa-onnx ^1.12.34` | Supports both STT+TTS but adds 80MB binary; overkill if using smart-whisper for STT |
|
||||
| OAuth | `openid-client ^6.8.2` | `passport-oauth2` | Adds middleware layer that conflicts with existing `better-auth` session handling |
|
||||
| Memory | `vectra ^0.12.3` + Ollama embeddings | `mem0ai` | Node SDK requires OpenAI; no local embedding option documented |
|
||||
| Memory | `vectra ^0.12.3` + Ollama embeddings | LangChain MemoryVectorStore | 40MB+ transitive dependency footprint; overkill for personal use scale |
|
||||
| Zero-config cloud | `@heyputer/puter.js` | Direct provider SDKs | Would require managing API keys per user; Puter eliminates this entirely |
|
||||
|
||||
---
|
||||
|
||||
## What NOT to Add
|
||||
|
||||
| Avoid | Why | Use Instead |
|
||||
|-------|-----|-------------|
|
||||
| `passport.js` | Conflicts with existing `better-auth`; adds middleware overhead | `openid-client v6` (certified, no middleware) |
|
||||
| `langchain` or `llamaindex` | 40-80MB dep footprint; overkill for single-user personal assistant | `vectra` + direct Ollama calls |
|
||||
| `mem0ai` Node SDK | OpenAI hard dependency in Node SDK; no local embedding option | Custom memory layer: `vectra` + Ollama `nomic-embed-text` |
|
||||
| `@mintplex-labs/piper-tts-web` | Browser-only, cannot be used in Node.js server | Piper binary subprocess |
|
||||
| Any browser extension for auth | Security risk; not applicable to local app | Standard PKCE via `openid-client` |
|
||||
| `electron` or `tauri` | PROJECT.md target is web app on Mac Mini, not desktop app | Existing Vite/Express architecture |
|
||||
| `fluent-ffmpeg` | Archived May 2025, no longer maintained | Direct `child_process.spawn` with `ffmpeg-static` binary |
|
||||
| `@ffmpeg-installer/ffmpeg` | Stale — last updated 2022, ships FFmpeg 4.x | `ffmpeg-static ^5.2.0` (ships FFmpeg 6.1.1, arm64 support) |
|
||||
| `telegraf` | TypeScript type system "too complex to understand" per maintainers; lower weekly downloads than grammY | `grammy ^1.41.1` |
|
||||
| `node-telegram-bot-api` | Low-level, requires callback polling setup, no middleware, more boilerplate | `grammy` |
|
||||
| `@ricky0123/vad-node` | Node.js support was discontinued by the maintainer; wound down | `@ricky0123/vad-react` (browser-only, which is where recording lives) |
|
||||
| `whisper.js` / `transformers.js` (browser WASM) | 200MB+ model download in browser; slow on first load; server-side Whisper via `smart-whisper` is already in place | `smart-whisper` on server (already in v1.5 stack) |
|
||||
| `@mintplex-labs/piper-tts-web` for server TTS | Browser WASM only, no Node.js support | Piper binary via `child_process.spawn` (already specified in v1.5) |
|
||||
| Wake word / real-time streaming audio | Out of scope per PROJECT.md | Future milestone |
|
||||
|
||||
---
|
||||
|
||||
## Version Compatibility Notes
|
||||
## Alternatives Considered
|
||||
|
||||
| Recommended | Alternative | When to Use Alternative |
|
||||
|-------------|-------------|-------------------------|
|
||||
| `grammy ^1.41.1` | `telegraf ^4.x` | If you need a battle-tested library with larger plugin ecosystem and tolerate complex TypeScript types |
|
||||
| `ffmpeg-static` + `spawn` | `@ffmpeg/ffmpeg` (WASM) | If running in a serverless/edge environment where native binaries are not available — not applicable here |
|
||||
| `@ricky0123/vad-react` | Manual `AudioWorklet` energy threshold | If you need lower latency or don't want the 5MB ONNX WASM payload; simpler but less accurate silence detection |
|
||||
| `@ricky0123/vad-react` | `MediaRecorder` with manual stop button (current impl) | The current v1.3 VoiceRecordButton works; VAD is strictly an UX upgrade |
|
||||
|
||||
---
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
| Package | Compatible With | Notes |
|
||||
|---------|-----------------|-------|
|
||||
| `systeminformation ^5.31.5` | Node.js >=18 | v6 is being rewritten in TS but not released; stick with v5 |
|
||||
| `smart-whisper ^0.8.1` | Node.js >=18, macOS arm64 | Prebuilt binaries for Apple Silicon — no compilation needed |
|
||||
| `openid-client ^6.8.2` | Node.js >=20 | v6 is a full rewrite; do not use v5 patterns (completely different API) |
|
||||
| `vectra ^0.12.3` | Node.js >=16 | File-based; no native addons, no compilation |
|
||||
| `@heyputer/puter.js` | Browser (Vite/ESM) | Not for Node.js server use |
|
||||
| `grammy ^1.41.1` | Node.js >=18, TypeScript >=5 | Ships its own types, no `@types/grammy` |
|
||||
| `ffmpeg-static ^5.2.0` | Node.js >=14, macOS arm64 | Downloads correct binary at `npm install` time via `optionalDependencies` |
|
||||
| `@ricky0123/vad-react ^0.0.36` | React 19, Vite 6 | React 19 peer dep fixed in August 2025; requires SharedArrayBuffer (COOP/COEP headers) for ONNX thread worker |
|
||||
| `smart-whisper ^0.8.1` | Node.js >=18, macOS arm64 | From v1.5 — verify it's actually installed before v1.6 starts |
|
||||
|
||||
**Critical COOP/COEP note for `@ricky0123/vad-react`:** The Silero VAD model runs in an ONNX Runtime Web worker that requires `SharedArrayBuffer`. This means the server must send these headers on HTML responses:
|
||||
|
||||
```
|
||||
Cross-Origin-Opener-Policy: same-origin
|
||||
Cross-Origin-Embedder-Policy: require-corp
|
||||
```
|
||||
|
||||
This is a one-line addition to the Express static file middleware. Without it, VAD silently fails in Chrome/Firefox. The existing PWA service worker may also need `Cross-Origin-Embedder-Policy: require-corp` to avoid breaking.
|
||||
|
||||
---
|
||||
|
||||
## Integration Architecture
|
||||
## Integration Architecture (v1.6 additions only)
|
||||
|
||||
```
|
||||
Browser (UI) Server (Express)
|
||||
───────────────── ────────────────────────────────
|
||||
@heyputer/puter.js ──────────→ No server proxy needed
|
||||
(Puter calls go direct to puter.com)
|
||||
Browser (UI) Server (Express)
|
||||
───────────────────────────────── ───────────────────────────────────────
|
||||
|
||||
React voice input ──────────→ POST /api/voice/transcribe
|
||||
└── smart-whisper (local STT)
|
||||
└── ~140MB model file in ~/.paperclip/voice/
|
||||
@ricky0123/vad-react POST /api/transcribe (existing)
|
||||
└── useMicVAD ─────→ └── ffmpeg-static: webm→wav16k
|
||||
└── onSpeechEnd(Float32Array) └── smart-whisper: wav→text
|
||||
└── userSpeaking (waveform UI) └── returns { text: string }
|
||||
|
||||
GET /api/system/hardware ←──── systeminformation
|
||||
└── GPU cores, total RAM, GPU model
|
||||
React ChatInput (updated) POST /api/voice/synthesize (new)
|
||||
└── voice mode toggle ─────→ └── Piper binary: text→wav
|
||||
└── auto-send on speech end └── returns audio/wav stream
|
||||
└── <audio> inline player ←──────────────┘
|
||||
|
||||
React onboarding OAuth ────────→ GET /api/oauth/google/start
|
||||
└── openid-client PKCE flow
|
||||
└── GET /api/oauth/google/callback
|
||||
|
||||
Personal assistant chat ───────→ POST /api/assistant/chat
|
||||
└── vectra recall (nomic-embed-text via Ollama)
|
||||
└── context injection → selected AI provider
|
||||
|
||||
TTS response ──────────────────→ POST /api/voice/synthesize
|
||||
└── piper binary subprocess
|
||||
└── returns raw PCM → browser Audio API
|
||||
Telegram bridge (new, optional)
|
||||
└── grammy long polling
|
||||
└── message:text → relayToNexus()
|
||||
└── message:voice →
|
||||
ffmpeg: ogg→wav16k
|
||||
smart-whisper → text
|
||||
relayToNexus() → response
|
||||
Piper → wav
|
||||
ffmpeg: wav→ogg
|
||||
ctx.replyWithVoice()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- [Puter.js developer docs](https://developer.puter.com/) — API structure, user-pays model confirmed
|
||||
- [Puter.js npm install](https://docs.puter.com/) — package name `@heyputer/puter.js` verified
|
||||
- [systeminformation npm](https://www.npmjs.com/package/systeminformation) — v5.31.5 latest, v6 in progress
|
||||
- [systeminformation GPU docs](https://systeminformation.io/graphics.html) — Apple Silicon GPU cores confirmed
|
||||
- [smart-whisper GitHub releases](https://github.com/JacobLinCool/smart-whisper/releases) — v0.8.1, October 2025
|
||||
- [openid-client npm](https://www.npmjs.com/package/openid-client) — v6.8.2, PKCE confirmed
|
||||
- [openid-client v6 migration discussion](https://github.com/panva/openid-client/discussions/702) — API changes documented
|
||||
- [vectra npm](https://www.npmjs.com/package/vectra) — v0.12.3, file-backed vector index
|
||||
- [Ollama embedding models](https://ollama.com/blog/embedding-models) — nomic-embed-text capability confirmed
|
||||
- [Piper TTS GitHub](https://github.com/rhasspy/piper) — macOS arm64 binary available
|
||||
- [Running Piper TTS with JS (Bun)](https://n4ze3m.com/blog/running-piper-tts-with-javascript-in-the-bun-runtime) — ONNX approach documented
|
||||
- [mem0 Node SDK docs](https://docs.mem0.ai/open-source/node-quickstart) — OpenAI default confirmed, no local option documented
|
||||
- [clack/prompts npm](https://www.npmjs.com/package/@clack/prompts) — v1.2.0 latest (CLI already uses ^0.10.0)
|
||||
- [npx bin field pattern](https://docs.npmjs.com/cli/v11/commands/npx/) — official npm docs
|
||||
- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling confirmed
|
||||
- [grammY GitHub](https://github.com/grammyjs/grammY) — Bot API 9.6 badge, v1.41.1 version
|
||||
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
|
||||
- [grammY comparison with Telegraf](https://grammy.dev/resources/comparison) — TypeScript type quality comparison
|
||||
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1
|
||||
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
|
||||
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, last published 3 months ago
|
||||
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — fixed August 28 2025, confirmed
|
||||
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz confirmed
|
||||
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement
|
||||
- [nodejs-whisper GitHub](https://github.com/ChetanXpro/nodejs-whisper) — v0.2.9 comparison (rejected: subprocess-based, 10 months stale)
|
||||
- [Piper TTS GitHub releases](https://github.com/rhasspy/piper/releases) — macOS aarch64 binary availability
|
||||
|
||||
---
|
||||
|
||||
*Stack research for: Nexus v1.5 Smart Onboarding + Personal AI Assistant*
|
||||
*Researched: 2026-04-02*
|
||||
*Prior milestone stack research (fork maintenance): see STACK.md entry dated 2026-03-30 (preserved above this file was overwritten — the fork maintenance content is in git history)*
|
||||
*Stack research for: Nexus v1.6 Voice Pipeline + Telegram Bridge*
|
||||
*Researched: 2026-04-03*
|
||||
*Supersedes: v1.5 STACK.md entries for smart-whisper and Piper — those remain valid; this file adds the glue and new libraries*
|
||||
|
|
|
|||
|
|
@ -1,17 +1,19 @@
|
|||
# Project Research Summary
|
||||
|
||||
**Project:** Nexus v1.5 — Smart Onboarding + Personal AI Assistant
|
||||
**Domain:** Forked open-source AI platform (Paperclip) — additive features on existing monorepo
|
||||
**Researched:** 2026-04-02
|
||||
**Project:** Nexus v1.6 — Voice Pipeline + Telegram Bridge
|
||||
**Domain:** Server-side STT/TTS voice pipeline with transport-agnostic service abstraction and a minimal Telegram relay bridge
|
||||
**Researched:** 2026-04-03
|
||||
**Confidence:** MEDIUM-HIGH
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Nexus v1.5 adds a smart multi-step onboarding flow and Personal AI Assistant mode to an existing, working Paperclip fork. The product's primary value is removing every barrier between "first run" and "first useful AI interaction" — hardware detection drives model recommendations, Puter.js eliminates API key requirements for cloud AI, and persistent memory makes the assistant mode feel meaningfully different from a stateless chat interface. Experts building this type of product treat the onboarding funnel as the highest-risk surface: users who cannot configure a provider in the first two minutes abandon. The recommended approach is a tiered provider architecture (local Ollama → Puter.js zero-config cloud → Google OAuth → API key) that steers users toward local-first and uses Puter.js as the escape hatch for users who won't install Ollama, not as the default recommendation.
|
||||
Nexus v1.6 adds two parallel capability tracks onto an existing React/Express/Paperclip monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a minimal Telegram bridge that reuses those pipeline primitives for phone access. The established expert pattern for this class of system is a shared service abstraction (`voicePipelineService`) that both the web HTTP layer and the Telegram bot call directly — never duplicating STT/TTS logic across transports. The Telegram bridge must be a thin relay only, forwarding messages to the existing `chatService` and returning the response, with no separate bot personality, no rich UI elements, and no per-user conversation branching beyond the existing single-workspace model.
|
||||
|
||||
The architecture is additive by design. Zero new database tables are introduced — all state lives in the existing `instance_settings.general` JSONB column or file-backed JSON in the server data directory. Four new server route sets mount into the existing Express app, and five new onboarding step components extend the existing NexusOnboardingWizard via Vite alias. This approach preserves upstream rebase safety, which remains the single most important constraint for a maintained fork. The most important technical decision is that Puter.js must be proxied through the server-side adapter system rather than called browser-direct, to preserve cost tracking, session management, and memory injection. This is not optional — browser-direct Puter.js is the primary anti-pattern and must be called out in every phase spec.
|
||||
The recommended approach is to build `voicePipelineService` first as the keystone service (`transcribe`, `synthesize`, `formatForVoice`), then wire the web voice UI improvements on top of it, then attach the Telegram bridge as a consumer of the same service. Audio format conversion via `ffmpeg-static` (not the archived `fluent-ffmpeg`) handles the two required transcoding paths: browser WebM/Opus to WAV 16kHz for Whisper, and Telegram OGG/Opus to WAV 16kHz for Whisper. The `@ricky0123/vad-react` library handles browser-side voice activity detection. `grammy ^1.41.1` handles the Telegram bot layer with long polling (correct for a local Mac Mini deployment without a public HTTPS endpoint).
|
||||
|
||||
The top risks are: (1) Puter.js bypassing the Paperclip adapter machinery if implemented browser-direct, (2) OAuth token storage in localStorage creating security exposure and upstream key collisions, (3) persistent memory injecting sensitive data (credentials, API keys) into system prompts without sanitization, (4) hardware detection returning misleading values on Apple Silicon where unified memory is shared between CPU and GPU, and (5) the onboarding probe endpoint requiring board auth that does not exist yet on a fresh install. All five risks are avoidable with explicit architectural constraints established before implementation begins.
|
||||
The key risks are: (1) audio format mismatches causing silent transcription failures across browsers and the Telegram path, which require ffmpeg transcoding at every entry point; (2) the voice mode flag being stripped as it traverses the message pipeline layers, causing agents to respond with full markdown that TTS then renders as "asterisk asterisk important asterisk asterisk"; (3) Piper being invoked as a new process per request, causing 200–800ms model reload latency on every TTS response and silent truncation on responses over ~400 characters; and (4) browser autoplay policy blocking audio playback unless the `AudioContext` is unlocked during the user's initial "start voice mode" gesture.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -19,139 +21,116 @@ The top risks are: (1) Puter.js bypassing the Paperclip adapter machinery if imp
|
|||
|
||||
### Recommended Stack
|
||||
|
||||
The existing stack (React 19, Vite 6, Tailwind v4, TanStack Query v5, LibSQL/Drizzle, Commander.js, better-auth) requires zero changes. All v1.5 additions are strictly additive. New server dependencies: `systeminformation ^5.31.5` (hardware detection), `openid-client ^6.8.2` (OAuth PKCE), `vectra ^0.12.3` (file-backed vector memory), `smart-whisper ^0.8.1` (local STT). New UI dependency: `@heyputer/puter.js` (zero-config cloud AI SDK, browser-only). Piper TTS uses no Node.js library — the `piper` binary is spawned via `child_process`. A new standalone package `packages/buildthis/` provides the `npx buildthis` CLI entry point with no additional external dependencies.
|
||||
|
||||
Key version constraints: `openid-client` v6 is a complete API rewrite and v5 patterns do not apply. `systeminformation` v5 is stable; v6 TypeScript rewrite is unreleased. `@heyputer/puter.js` is browser-only and must never be imported in server code. `smart-whisper ^0.8.1` has prebuilt macOS arm64 binaries, avoiding compilation on the Mac Mini M4 target. `vectra ^0.12.3` is file-backed with no native addons — no compilation required anywhere.
|
||||
v1.6 is additive to the v1.5 stack. The existing `smart-whisper`, `@mintplex-labs/piper-tts-web`, `multer`, and Express foundations remain unchanged. Three new libraries are required.
|
||||
|
||||
**Core technologies:**
|
||||
- `@heyputer/puter.js`: Zero-config cloud AI (UI auth only, never server) — user-pays model, 500+ models, zero developer billing
|
||||
- `systeminformation ^5.31.5`: Server-side GPU/RAM/Apple Silicon detection — only comprehensive cross-platform Node.js option with 20M+ monthly downloads
|
||||
- `smart-whisper ^0.8.1`: Local STT via whisper.cpp — native Apple Neural Engine acceleration on macOS arm64; prebuilt binaries available
|
||||
- Piper binary via `child_process`: Text-to-speech — no mature Node.js binding exists; subprocess is the proven production path
|
||||
- `openid-client ^6.8.2`: OAuth PKCE flows — certified, middleware-free, works alongside existing `better-auth`
|
||||
- `vectra ^0.12.3` + Ollama `nomic-embed-text`: File-backed semantic memory — zero infrastructure, MIT licensed, reuses existing Ollama ecosystem, avoids OpenAI dependency
|
||||
- `packages/buildthis/`: npx CLI bootstrapper — standard `bin` field pattern, uses only Node.js built-ins and existing `@clack/prompts`
|
||||
|
||||
**What NOT to add:** `passport.js` (conflicts with existing `better-auth`), LangChain/LlamaIndex (40-80MB footprint), `mem0ai` Node SDK (OpenAI hard dependency), `@mintplex-labs/piper-tts-web` on the server (browser-only).
|
||||
- `@ricky0123/vad-react ^0.0.36` (ui/) — Browser-side Silero VAD via ONNX Runtime Web; delivers `Float32Array` at 16kHz on speech end; React 19 peer dep confirmed fixed August 2025; requires COOP/COEP headers for `SharedArrayBuffer`
|
||||
- `ffmpeg-static ^5.2.0` (server/) — Ships FFmpeg 6.1.1 binaries including macOS arm64; invoked via `child_process.spawn`; do NOT use the archived `fluent-ffmpeg` (archived May 2025) or stale `@ffmpeg-installer/ffmpeg` (FFmpeg 4.x)
|
||||
- `grammy ^1.41.1` (server/) — TypeScript-native Telegram bot framework (1.4M weekly downloads, higher than Telegraf); long polling for local deployment; clean file handling API via `ctx.getFile()`; Bot API 9.6 support confirmed
|
||||
|
||||
No new library is required for server-side Piper TTS (existing `child_process.spawn` pattern from v1.5) or audio playback (native `<audio>` element + Web Audio API).
|
||||
|
||||
**Critical compatibility note:** `@ricky0123/vad-react` requires COOP/COEP HTTP headers on HTML responses for `SharedArrayBuffer` support. Without them, VAD silently fails in Chrome and Firefox. One-line addition to Express static file middleware.
|
||||
|
||||
### Expected Features
|
||||
|
||||
**Must have (v1.5 launch — P1):**
|
||||
- Mode selection (Personal AI / Project Builder / Both) — gates all assistant-specific features; minimum valid state for skip-all must be defined first
|
||||
- Hardware auto-detection + RAM/VRAM-aware model recommendation — primary UX claim; Apple Silicon requires special handling
|
||||
- Puter.js zero-config cloud tier — removes Ollama installation barrier; must be server-proxied
|
||||
- Personal AI Assistant chat with persistent memory — defines the mode as meaningfully different from stateless chat
|
||||
- Summary screen landing straight into chat — closes the onboarding funnel
|
||||
- Every step skippable — PROJECT.md requirement; skip-all must produce a working workspace with one agent
|
||||
- Piper TTS — completes the voice loop Whisper STT started in v1.3
|
||||
**Must have (table stakes — v1.6 launch):**
|
||||
|
||||
**Should have (v1.5.x — P2 differentiators):**
|
||||
- Project handoff ("turn this conversation into a project") — novel UX, no off-the-shelf solution; requires stable assistant mode first
|
||||
- MCP server connections (curated list, one-click add) — power user expectation; namespace all tool names to avoid Hermes skill collisions
|
||||
- Google OAuth cloud tier (Gemini without API key) — escape hatch when Puter.js limits surface; policy risk with third-party OAuth needs documentation
|
||||
- `npx buildthis` CLI entry point — zero-install UX; verify `npm search buildthis` for name collision before publishing
|
||||
- Silence-based auto-submit via `@ricky0123/vad-react` — users expect this; manual stop feels archaic
|
||||
- Waveform/amplitude visualization while recording — without it users cannot confirm mic is active
|
||||
- Voice response auto-play with toggle — users expect playback to be automatic unless disabled
|
||||
- Markdown-free voice responses — spoken markdown sounds broken; dual output (prose + full markdown) is the correct solution
|
||||
- Telegram text relay with agent prefix — core use case for phone access; format: `[AgentName]: response`
|
||||
- Telegram voice note transcription — mobile Telegram users default to voice notes; ignoring them immediately frustrates
|
||||
|
||||
**Should have (differentiators, add after validation):**
|
||||
|
||||
- Telegram TTS reply option (OGG voice note reply back) — add after text relay is validated
|
||||
- Sentence-buffered TTS streaming — start playing sentence 1 while sentence 2 synthesizes; reduces perceived latency
|
||||
|
||||
**Defer (v2+):**
|
||||
- OpenAI OAuth free tier — aggressive rate limits, unstable UX, LOW confidence on specifics
|
||||
- Cloud memory sync — GDPR scope, multi-device auth, enormous complexity for single-user product
|
||||
- Multi-MCP orchestration — enterprise complexity for personal tool
|
||||
- Streaming TTS word-by-word — browser Audio API complexity; sentence-buffered TTS is the practical optimum
|
||||
|
||||
**Provider tier ordering (steers users correctly):**
|
||||
Tier 0 (existing Hermes/Claude Code/OpenClaw) → Tier 1 (local Ollama, most private) → Tier 2 (Puter.js zero-config) → Tier 3 (Google OAuth) → Tier 4 (API key/subscription). Tiers 0 and 1 are the recommendations; Tier 2 is the fallback, not the default.
|
||||
- Real-time speech-to-speech — requires full-duplex WebSocket audio + Pipecat/LiveKit; entirely different architecture
|
||||
- Wake word detection — always-on mic; hardware device concern
|
||||
- Deep Telegram web chat session sync — requires Postgres pub/sub event bus; explicitly deferred per PROJECT.md
|
||||
- Per-agent Telegram bots — maintenance nightmare; single bot + agent prefix is the correct approach
|
||||
|
||||
### Architecture Approach
|
||||
|
||||
All v1.5 features hook into existing extension points without touching DB schema, API routes, or TypeScript identifiers from upstream. The NexusOnboardingWizard Vite alias continues as the sole onboarding replacement surface. File-backed JSON in `data/memory/<companyId>.json` handles assistant memory with no migration. Puter.js is proxied through a new `puterProxyService` that stores the auth token in `company_secrets` and pipes SSE output in the exact format the existing `useStreamingChat` hook already consumes. The four new server route sets (hardware, puter-proxy, voice, memory) are mounted in a single four-line addition to `app.ts`.
|
||||
The architecture is built around a single server-side `voicePipelineService` that both HTTP voice routes and the Telegram relay call directly, with no HTTP round-trip within the same process. The existing `chatService` and `puterProxyService` are consumed directly by the Telegram bridge as TypeScript function calls. `nexus-settings.json` (not DB) stores `voiceMode` enum and `telegramToken`. No DB schema changes are required.
|
||||
|
||||
**Major components:**
|
||||
1. `hardwareService` (NEW, server) — detects GPU/RAM/Apple Silicon via `systeminformation`; 5-min cache; returns `{ unifiedMemory: true, totalBytes }` for M-series chips; unauthenticated endpoint required for pre-auth onboarding
|
||||
2. `puterProxyService` (NEW, server) — stores Puter auth token in `company_secrets`, proxies AI calls as SSE matching existing chat pipeline format; Puter auth popup is UI-only
|
||||
3. `voiceService` (NEW, server) — manages `smart-whisper` for STT and Piper binary subprocess for TTS; graceful degradation if models not downloaded
|
||||
4. `memoryService` (NEW, server) — file-backed JSON memory per `companyId`; sanitization blocklist at write time; injects formatted memory block into system prompt
|
||||
5. `NexusOnboardingWizard.tsx` (MODIFIED, UI) — multi-step wizard consuming 5 new step components from `ui/src/components/onboarding/`
|
||||
6. `PersonalAssistantPage` (NEW, UI) — full-screen assistant experience; re-uses ChatPanel with `assistantMode` prop; lazy-loaded
|
||||
7. `packages/buildthis/` (NEW) — standalone npm package; health-check detects running Nexus; opens browser or guides install
|
||||
|
||||
**Build dependency order (from ARCHITECTURE.md):**
|
||||
Phase 1 (Hardware) → Phase 2 (Puter Proxy) → Phase 3 (Wizard Assembly) → Phase 4 (Memory + Assistant Mode) → Phase 5 (Voice) → Phase 6 (buildthis CLI)
|
||||
1. `voicePipelineService` (`server/src/services/voice-pipeline.ts`) — Transport-agnostic STT/TTS core; `transcribe(buffer, format)`, `synthesize(text, voiceId?)`, `formatForVoice(text)` — the keystone abstraction for v1.6
|
||||
2. `telegram service` (`server/src/services/telegram.ts`) — grammY bot lifecycle + thin relay; calls `voicePipelineService` and `chatService` directly; long polling; one persistent `sessionId` per Telegram `chatId`
|
||||
3. `voice.ts` route (`server/src/routes/voice.ts`) — HTTP wrappers for `POST /api/transcribe` (moved from `chat-files.ts`) and new `POST /api/synthesize`; keeps `chat-files.ts` close to upstream for clean rebases
|
||||
4. UI voice components (`VoiceMicButton`, `WaveformDisplay`, `VoiceModeToggle`, `useVoiceMode`, `useSilenceDetection`) — all new; enhance existing `ChatInput` without replacing `VoiceRecordButton`
|
||||
5. `nexus-settings` schema extension — adds `voiceMode: "text" | "voice_input" | "full_voice"` and optional `telegramToken`; no DB migration needed
|
||||
|
||||
**Key patterns to follow:**
|
||||
|
||||
- Move `/transcribe` out of `chat-files.ts` into `voice.ts` to reduce upstream rebase conflict surface
|
||||
- Use `execFile` (not `exec`) for CLI subprocess calls — prevents shell injection, matches existing codebase pattern
|
||||
- Store Telegram token in `nexus-settings.json`, not in DB — DB migrations conflict on rebase
|
||||
- Long polling (`bot.start()`) not webhooks — Mac Mini is behind NAT with no public HTTPS endpoint
|
||||
- Wrap all CLI calls (`piper`, `ffmpeg`) in `Promise.race([call, timeout(8000)])` for graceful degradation
|
||||
|
||||
### Critical Pitfalls
|
||||
|
||||
1. **Puter.js browser-direct bypasses the adapter system** — cost tracking, session codec, and memory injection all break. `@heyputer/puter.js` in the UI is for the auth popup only; all AI calls go through `POST /api/puter-proxy/chat`. Recovery cost if shipped wrong: HIGH.
|
||||
1. **Audio format mismatch at every entry point** (Pitfall 27, 28) — Browser produces WebM/Opus. Telegram produces OGG/Opus 48kHz. Whisper requires WAV 16kHz mono. Always transcode via ffmpeg at every audio entry point with explicit `-ar 16000 -ac 1`. Make ffmpeg a hard startup dependency with absolute binary path, not PATH-resolved.
|
||||
|
||||
2. **OAuth tokens in localStorage** — XSS exposure; key collision with upstream Paperclip `localStorage` keys. All OAuth tokens (Google, Puter) must be stored server-side via existing `secretService`. Browser holds only a session indicator.
|
||||
2. **Voice mode flag stripped in message pipeline** (Pitfall 32) — The `voiceMode: true` flag on messages must survive every pipeline layer (client → Express → message persistence → agent session codec → Hermes adapter system prompt). If stripped at any layer, the agent responds in full markdown and TTS synthesizes spoken symbols. Audit every layer before building dual output on top of it.
|
||||
|
||||
3. **Persistent memory storing credentials** — regex-based blocklist (API key patterns, token patterns) must be applied at write time, not retrieval time. MCP tool results and user-pasted content both need the same sanitization. Recovery cost if shipped without: HIGH (requires retroactive purge).
|
||||
3. **Piper process-per-request anti-pattern** (Pitfall 29) — Spawning a new `piper` process per TTS request reloads the ONNX model each time (200–800ms overhead). Long responses (>400 chars) silently truncate. Sentence-chunk text before synthesis. Implement warmup call at server startup. Use absolute binary paths for service-mode deployment.
|
||||
|
||||
4. **Apple Silicon VRAM reporting** — M-series has unified memory; `os.totalmem()` is NOT GPU VRAM. Use `os.freemem()` as baseline, apply 0.75 multiplier, label all recommendations as "estimated." UI copy must say "unified memory" not "VRAM" for M-series chips.
|
||||
4. **Browser autoplay policy blocking TTS playback** (Pitfall 40) — `audio.play()` is blocked unless triggered by a user gesture. The "start voice mode" button click must unlock an `AudioContext` (`ctx.resume()`); subsequent programmatic playback via `AudioBufferSourceNode` works without further gestures. Developers with autoplay whitelisted in dev browsers never see this failure.
|
||||
|
||||
5. **Onboarding probe auth-gated on board auth** — hardware detection runs before board auth exists on a fresh install. A separate unauthenticated `GET /system/providers` endpoint is required. Without it, all provider probing silently returns 403 and auto-detection never works.
|
||||
5. **Telegram bot event loop blocking on voice pipeline** (Pitfall 37) — File download + ffmpeg transcode + Whisper transcription takes 2–5 seconds. If the handler awaits all of this synchronously, Telegram resends the update and the bot processes the same voice message multiple times. Acknowledge the update immediately, process async, send intermediate "Transcribing..." status to user.
|
||||
|
||||
6. **Vite alias silent divergence from upstream** — after every upstream rebase, diff `OnboardingWizard.tsx` against the prior upstream version and integrate any new props into `NexusOnboardingWizard.tsx`. Without this protocol, upstream wizard improvements are silently discarded.
|
||||
|
||||
7. **Piper TTS cold start hang** — WASM voice model downloads on first synthesis call (5–30 seconds), appearing as a broken feature. Pre-warm the model on a background thread during the onboarding voice step. Show download progress before enabling the toggle.
|
||||
|
||||
8. **Multi-provider creating competing defaults** — one primary provider per agent; do not let the wizard create PM and Engineer on different providers silently. Project Builder agents default to local/privacy-first; Personal AI assistant defaults to highest-quality available.
|
||||
6. **Piper/ffmpeg not found when running as system service** (Pitfall 38) — `spawn('piper', ...)` resolves via shell PATH in interactive terminals but not in `launchd`/`systemd` service environments. Store absolute binary paths in `nexus-settings` config; use them explicitly in every `spawn()` call.
|
||||
|
||||
---
|
||||
|
||||
## Implications for Roadmap
|
||||
|
||||
Based on the dependency graph in ARCHITECTURE.md and the pitfall-to-phase mapping in PITFALLS.md, the build order is fixed by component dependencies and upstream-conflict risk sequencing.
|
||||
Based on research, the component dependency graph strongly suggests a 4-phase structure:
|
||||
|
||||
### Phase 1: Hardware Detection + Mode Selection Foundation
|
||||
**Rationale:** All other phases depend on knowing the hardware tier and the user's chosen mode. Mode selection gates which features are surfaced. Hardware detection drives model recommendations. Critically, the unauthenticated probe endpoint (Pitfall 14) and the skip-all minimum valid state (Pitfall 22) must both be defined here as test cases before any provider probing or wizard step is built. This is the riskiest design phase even though it contains no upstream file modifications.
|
||||
**Delivers:** `hardwareService`, `GET /api/hardware/info`, unauthenticated `GET /system/providers`, `HardwareSummaryStep` and `ModeSelector` components, model recommendation lookup table with Apple Silicon handling, skip-all minimum valid state definition and test
|
||||
**Addresses:** Hardware auto-detection + model recommendation (P1), Mode selection UI (P1)
|
||||
**Avoids:** Pitfalls 13 (Apple Silicon VRAM), 14 (probe auth), 17 (competing defaults), 22 (skip-all breakage), 26 (stale model catalog fallback heuristic)
|
||||
### Phase 1: Voice Pipeline Foundation
|
||||
**Rationale:** `voicePipelineService` is the keystone — every other v1.6 feature calls it. Cannot build web voice UI improvements or the Telegram bridge without it. Schema extension for `voiceMode` also gates downstream work. Moving `/transcribe` to `voice.ts` reduces rebase friction before any other work begins.
|
||||
**Delivers:** `nexus-settings` schema with `voiceMode` + `telegramToken`; `voicePipelineService` with `transcribe`, `synthesize`, `formatForVoice`; `voice.ts` route with `/api/transcribe` (moved from `chat-files.ts`) and `/api/synthesize`; ffmpeg integration for WebM→WAV and OGG→WAV transcoding; `voiceMode` flag on `createMessageSchema` and `ChatMessage` shared type
|
||||
**Addresses:** Transport-agnostic pipeline (differentiator unlocking all features), voice mode flag storage (required by all consumers), server-side synthesize endpoint (required by Telegram bridge)
|
||||
**Avoids:** Pitfall 27 (audio format mismatch), Pitfall 32 (voice flag propagation path established before consumers built), Pitfall 38 (absolute binary paths baked in from the start), Pitfall 29 (sentence-chunked synthesis from the start)
|
||||
**Research flag:** Standard patterns — `execFile`, WAV format conversion, service abstraction are well-documented. Skip `/gsd:research-phase`.
|
||||
|
||||
### Phase 2: Puter.js Zero-Config Cloud Tier
|
||||
**Rationale:** Puter.js is the primary escape hatch for users who won't install Ollama. The server-proxy pattern must be established before the UI provider step is built — implementing UI first creates the risk of accidentally wiring browser-direct calls. The Puter auth popup is the one legitimate browser-side use; everything else is server-mediated.
|
||||
**Delivers:** `puterProxyService`, `POST /api/puter-proxy/chat` (SSE relay), `POST /api/puter-proxy/auth`, Puter section of `ProviderTierStep` (UI), Puter auth popup, Puter token storage via `secretService`
|
||||
**Uses:** `@heyputer/puter.js` (UI popup only), server-side HTTP calls to Puter API
|
||||
**Avoids:** Pitfall 15 (Puter.js bypassing adapter system), Pitfall 16 (OAuth tokens in localStorage)
|
||||
### Phase 2: Web Chat Voice UI
|
||||
**Rationale:** UI improvements depend only on Phase 1 pipeline and are independent of Telegram. Establishes the voice UX foundation that users interact with directly. Validates the voice mode flag end-to-end before Telegram consumes the same flag.
|
||||
**Delivers:** `VoiceMicButton` with `@ricky0123/vad-react` silence detection; `WaveformDisplay` via AnalyserNode; `VoiceModeToggle` three-state control; `useVoiceMode` and `useSilenceDetection` hooks; `ChatMessage` dual output (voice badge + expandable full markdown); `TtsButton` auto-play prop; COOP/COEP headers on Express static middleware
|
||||
**Addresses:** Silence auto-submit (table stakes), waveform visualization (table stakes), auto-play toggle (table stakes), voice mode setting (table stakes), markdown-free voice responses (table stakes)
|
||||
**Avoids:** Pitfall 31 (VAD library vs. naive RMS threshold), Pitfall 40 (AudioContext unlocked on voice mode start button), Pitfall 35 (sanitizeForTTS utility exists before first TTS integration test)
|
||||
**Research flag:** `@ricky0123/vad-react` API is confirmed via docs; COOP/COEP header pattern is standard Express middleware. Skip `/gsd:research-phase`.
|
||||
|
||||
### Phase 3: Multi-Step Onboarding Wizard Assembly
|
||||
**Rationale:** After hardware detection and Puter.js are independently built and tested, the wizard is assembled. This is the phase that modifies `NexusOnboardingWizard.tsx` substantially — establish the post-rebase diff protocol before touching this file. The ProviderTierStep covers all provider tiers (local, Puter, OAuth). VoiceSetupStep UI shell is included here; voice service is wired in Phase 5.
|
||||
**Delivers:** Refactored `NexusOnboardingWizard.tsx` (multi-step), `OnboardingSummaryStep`, `VoiceSetupStep` (shell only), OAuth PKCE popup pattern for Google Gemini, `instance_settings.general.nexus` config write, navigation routing to PersonalAssistantPage vs Dashboard
|
||||
**Implements:** Onboarding Wizard data flow from ARCHITECTURE.md
|
||||
**Avoids:** Pitfall 12 (Vite alias divergence — diff protocol in place), Pitfall 22 (skip-all confirmed from Phase 1), Pitfall 17 (one primary provider per mode)
|
||||
### Phase 3: Telegram Bridge
|
||||
**Rationale:** Telegram bridge is a pure consumer of Phase 1's `voicePipelineService` and the existing `chatService`. No web UI changes needed. Must follow Phase 1 but is independent of Phase 2.
|
||||
**Delivers:** `telegramService` with grammY long polling; text relay to `chatService`; voice note relay (OGG download → ffmpeg transcode → transcribe → agent → text reply); persistent `chatId → sessionId` mapping; agent prefix on replies; `POST /api/telegram/token` and `GET /api/telegram/status` management routes
|
||||
**Addresses:** Telegram text relay (table stakes), Telegram voice note relay (table stakes), agent identity visible in Telegram replies (table stakes)
|
||||
**Avoids:** Pitfall 28 (OGG 48kHz → WAV 16kHz explicit transcode, not assumed), Pitfall 33 (persistent session per chatId, not per message), Pitfall 34 (long polling; delete any existing webhook first), Pitfall 37 (async pipeline; acknowledge immediately; send "Transcribing..." status)
|
||||
**Research flag:** Needs `/gsd:research-phase` for grammY session management (persistent `chatId → sessionId` mapping approach vs. grammY conversation plugin) and async update acknowledgement pattern before implementation.
|
||||
|
||||
### Phase 4: Persistent Memory + Personal Assistant Mode
|
||||
**Rationale:** Memory injection modifies the existing chat route — the highest-risk upstream file modification in the entire milestone. It comes after onboarding is validated so mode context is reliable before memory is scoped to it. Memory sanitization is built at write time into the schema (not patched post-launch). This phase also defines the conversation isolation strategy between assistant and project builder modes.
|
||||
**Delivers:** `memoryService` with write-time sanitization blocklist, `GET/POST/DELETE /api/companies/:id/memory`, memory injection in chat route (MODIFIED), `PersonalAssistantPage`, `AssistantMemoryBar`, `useAssistantMemory` hook, conversation isolation via agent-based filter
|
||||
**Avoids:** Pitfall 19 (credential injection via memory), Pitfall 23 (assistant/project builder context bleed)
|
||||
|
||||
### Phase 5: Voice (Whisper STT + Piper TTS)
|
||||
**Rationale:** Independent of Phase 4 but requires Phase 3 (onboarding wizard must exist to surface VoiceSetupStep). Piper pre-warming strategy must be designed before the TTS toggle is wired, not after. This phase is isolated enough to be deprioritized or built in parallel without blocking Phase 4 or 6.
|
||||
**Delivers:** `voiceService` (smart-whisper + Piper subprocess), `POST /api/voice/transcribe`, `POST /api/voice/speak`, `GET /api/voice/status`, VoiceSetupStep wired into onboarding wizard, `useVoiceInput` and `useVoiceSpeech` hooks, ChatInput mic button (MODIFIED — upstream file, low risk), Piper pre-warm background thread with download progress indicator
|
||||
**Avoids:** Pitfall 18 (Piper TTS cold start hang)
|
||||
|
||||
### Phase 6: npx buildthis CLI Bootstrapper
|
||||
**Rationale:** Fully independent of all other phases. Can be built in parallel or deferred to v1.5.x. P2 priority — useful for sharing Nexus but not required for core assistant functionality. Primary gate is verifying `npm search buildthis` for package name collision before publishing.
|
||||
**Delivers:** `packages/buildthis/` standalone package, `bin.buildthis` entry point, health-check detection of running Nexus (`GET localhost:4000/api/health`), npm publish configuration
|
||||
**Avoids:** Pitfall 21 (npx package name collision)
|
||||
### Phase 4: Polish and Post-Launch Additions
|
||||
**Rationale:** After core voice and Telegram are validated, add differentiator features that require voice pipeline stability. These are explicitly post-validation based on user feedback triggers.
|
||||
**Delivers:** Telegram TTS reply (synthesize OGG voice note reply); sentence-buffered TTS streaming; Piper persistent warmup optimization; voice response history in chat UI
|
||||
**Addresses:** Sentence-buffered TTS (differentiator), Telegram TTS reply (differentiator)
|
||||
**Avoids:** Pitfall 39 (dual output via single LLM call, not two calls), Pitfall 29 (persistent Piper process architecture)
|
||||
**Research flag:** Flag for `/gsd:research-phase` on Piper persistent HTTP wrapper — community `piper-http` package status is unconfirmed; verify before committing to this approach.
|
||||
|
||||
### Phase Ordering Rationale
|
||||
|
||||
- Phase 1 must precede all others because mode and hardware are inputs to every subsequent phase's UX decisions. Skip-all state definition is a hard prerequisite for Phase 3.
|
||||
- Phase 2 (Puter.js proxy) precedes wizard assembly (Phase 3) because the server proxy pattern must exist before the UI references it — wiring UI first creates the anti-pattern risk.
|
||||
- Phase 4 (memory) is separated from Phase 3 (wizard) because the chat route modification is the highest upstream-conflict-risk step and deserves its own isolated phase after onboarding is stable and tested.
|
||||
- Phase 5 (voice) and Phase 6 (buildthis) are independent of each other and can be built in either order or in parallel.
|
||||
- Each phase delivers a working, rebasing-clean state — upstream sync can occur between any two phases without compound conflicts.
|
||||
|
||||
### Research Flags
|
||||
|
||||
Phases likely needing deeper research during planning:
|
||||
- **Phase 2 (Puter.js):** Puter rate limits and Node.js HTTP API behavior are not publicly documented. Need to verify server-side streaming API surface and token refresh behavior before designing the proxy service. Also: confirm Puter's terms of service allow server-side relaying of requests.
|
||||
- **Phase 4 (Memory):** The specific injection hook location in `server/src/services/chat.ts` needs codebase inspection to confirm the right insertion point. Also: decide between linear scan (v1.5) vs vectra vector search (v2) based on expected corpus size — should be explicit in the spec.
|
||||
- **Phase 5 (Voice):** `smart-whisper ^0.8.1` Apple Neural Engine acceleration claim needs verification on the actual Mac Mini M4 target before committing to `base.en` as the default model. If acceleration is not confirmed, fall back to `tiny.en`.
|
||||
|
||||
Phases with standard patterns (skip research-phase):
|
||||
- **Phase 1 (Hardware Detection):** `systeminformation` is mature (20M+ monthly downloads), Apple Silicon behavior is officially documented. Pattern is well-established across Ollama, LM Studio, llm-checker.
|
||||
- **Phase 3 (Wizard Assembly):** React multi-step wizard patterns are well-documented. NexusOnboardingWizard Vite alias pattern is already live in the codebase.
|
||||
- **Phase 6 (buildthis CLI):** Standard npm `bin` field pattern per official Node.js docs. No novel choices.
|
||||
- `voicePipelineService` (Phase 1) strictly precedes both Phase 2 and Phase 3 — this is the hardest dependency in the v1.6 graph
|
||||
- Phase 2 and Phase 3 are independent of each other and can run in parallel for two-developer teams; sequential ordering here assumes single-developer delivery
|
||||
- `voiceMode` schema change (Phase 1) must precede `ChatMessage` dual output (Phase 2) — shared package change gates UI work
|
||||
- Moving `/transcribe` from `chat-files.ts` to `voice.ts` in Phase 1 reduces rebase conflict surface before any other work begins
|
||||
- Phase 4 is explicitly post-validation — only add Telegram TTS reply and sentence-buffered streaming after confirming the basic pipeline is stable in real use
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -159,47 +138,48 @@ Phases with standard patterns (skip research-phase):
|
|||
|
||||
| Area | Confidence | Notes |
|
||||
|------|------------|-------|
|
||||
| Stack | HIGH | Most libraries verified via official docs; `systeminformation` and `openid-client` v6 fully confirmed; `smart-whisper` version from GitHub releases with no production deployment data |
|
||||
| Features | MEDIUM | Puter.js rate limits and production reliability unverified at scale; hardware detection patterns confirmed from Ollama/LM Studio ecosystem; UX recommendations inferred from Clerk/Vercel/Postman patterns |
|
||||
| Architecture | HIGH | Based on direct codebase inspection of `/opt/nexus/`; all extension points verified to exist; file-backed JSON approach confirmed feasible given single-user M4 Mini target |
|
||||
| Pitfalls | HIGH | Based on direct codebase analysis plus targeted research per integration domain; Apple Silicon VRAM behavior confirmed; Puter.js adapter risk confirmed from architecture analysis |
|
||||
| Stack | MEDIUM-HIGH | grammy HIGH (official docs, Bot API 9.6 verified); ffmpeg-static MEDIUM (arm64 confirmed, pipe approach verified); vad-react MEDIUM (React 19 fix confirmed via GitHub issue; ONNX WASM SharedArrayBuffer behavior requires COOP/COEP header testing) |
|
||||
| Features | MEDIUM-HIGH | STT/TTS pipeline patterns well-documented; dual output prompt engineering reliability is MEDIUM — smaller 7B models produce malformed structured output ~10% of the time; Approach B fallback (post-processing strip) must be implemented |
|
||||
| Architecture | HIGH | Based on direct codebase inspection of actual source files; service boundary and data flow verified; no speculative assumptions |
|
||||
| Pitfalls | HIGH | Based on direct codebase analysis plus targeted research on each integration domain; v1.6 pitfalls 27–40 are specific, sourced, and actionable |
|
||||
|
||||
**Overall confidence:** MEDIUM-HIGH
|
||||
|
||||
### Gaps to Address
|
||||
|
||||
- **Puter.js Node.js API surface:** Server-side streaming via HTTP (not the browser SDK) needs verification before `puterProxyService` is specced. Architecture assumes `stream: true` works server-side — confirm during Phase 2 planning.
|
||||
- **Puter.js rate limits and ToS:** "No restrictions" claim is unverified at scale. Design graceful degradation for rate limit responses. Attribute all costs to user's Puter account in UI copy.
|
||||
- **smart-whisper Apple Silicon acceleration:** Performance claim needs on-device verification on the Mac Mini M4 target. If not confirmed, `tiny.en` may be required as default instead of `base.en`.
|
||||
- **Google Gemini OAuth policy risk:** Using Gemini CLI OAuth with third-party apps may trigger abuse detection (GitHub issue #21866 confirmed). Gate this tier on users with active Gemini subscriptions; document limitation explicitly.
|
||||
- **Memory store performance ceiling:** Linear scan is acceptable for fewer than ~500 entries. Define the upgrade threshold to `vectra` vector search during Phase 4 planning and document it in the code.
|
||||
- **OpenAI OAuth free tier:** LOW confidence — OpenAI free tier OAuth specifics change frequently. Do not include in v1.5 scope; defer to v2+.
|
||||
- **grammY session management approach:** Lightweight in-memory `Map<chatId, sessionId>` vs. grammY conversation plugin — not evaluated. Validate during Phase 3 research-phase before implementation.
|
||||
- **Dual output prompt reliability on 7B models:** Works reliably on larger models; ~90% on 7B tier. Approach B fallback (post-processing strip) must be implemented as a safety net, not treated as optional. Design both before Phase 1 ships.
|
||||
- **Piper persistent process viability:** Sentence-chunked per-request synthesis avoids the worst of the reload latency, but a persistent Piper HTTP wrapper would be cleaner long-term. Community `piper-http` status unconfirmed. Flag for Phase 4 research-phase.
|
||||
- **smart-whisper OGG support:** Whether `smart-whisper` can ingest OGG directly (avoiding ffmpeg for the Telegram path) or always requires WAV was not confirmed. Verify at Phase 1 start — if OGG is accepted natively, the Telegram transcription path can skip one transcode step.
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- `/opt/nexus/` direct codebase inspection (ARCHITECTURE.md) — extension points, existing patterns, upstream file risk
|
||||
- [Puter.js developer docs](https://developer.puter.com/) — user-pays model, API structure confirmed
|
||||
- [systeminformation official docs](https://systeminformation.io/graphics.html) — Apple Silicon GPU core detection confirmed
|
||||
- [openid-client npm v6](https://www.npmjs.com/package/openid-client) — PKCE, v6 API confirmed
|
||||
- [Piper TTS GitHub](https://github.com/rhasspy/piper) — macOS arm64 binary, CPU-capable, MIT license
|
||||
- [npx bin field pattern](https://docs.npmjs.com/cli/v11/commands/npx/) — official npm docs
|
||||
- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling, Bot API 9.6 support
|
||||
- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling vs. webhooks recommendation for local deployment
|
||||
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1, pipe-based invocation pattern
|
||||
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement, 48kHz mono wire format
|
||||
- Direct codebase inspection: `server/src/routes/chat-files.ts`, `chat.ts`, `services/nexus-settings.ts`, `app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `TtsButton.tsx`, `hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts`
|
||||
- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- [smart-whisper GitHub releases](https://github.com/JacobLinCool/smart-whisper/releases) — v0.8.1, Apple Silicon acceleration claim
|
||||
- [vectra npm](https://www.npmjs.com/package/vectra) — file-backed vector index, MIT license
|
||||
- [Ollama embedding models](https://ollama.com/blog/embedding-models) — nomic-embed-text capability confirmed
|
||||
- [mem0 Node SDK docs](https://docs.mem0.ai/open-source/node-quickstart) — OpenAI default confirmed, no local option documented
|
||||
- [Google Gemini free tier 2026](https://www.aifreeapi.com/en/posts/google-gemini-api-free-tier) — Gemini 2.0 Flash free tier
|
||||
- [Google Gemini OAuth via Opencode](https://syntackle.com/blog/google-gemini-ai-subscription-with-opencode/) — OAuth pattern confirmed in related tool
|
||||
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, React 19 fix confirmed
|
||||
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — React 19 peer dep fix confirmed August 2025
|
||||
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz output confirmed
|
||||
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
|
||||
- [Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
|
||||
- [The Voice AI Stack for Building Agents (assemblyai.com)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
|
||||
- [Telegram speech-to-text bot with Node.js (loonskai.com)](https://www.loonskai.com/blog/telegram-speech-to-text-bot-with-nodejs)
|
||||
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
|
||||
|
||||
### Tertiary (LOW confidence)
|
||||
- [Running Piper TTS with JS (Bun)](https://n4ze3m.com/blog/running-piper-tts-with-javascript-in-the-bun-runtime) — subprocess approach validated in Bun; no Node.js production data
|
||||
- [Google Gemini OAuth policy risk](https://github.com/google-gemini/gemini-cli/issues/21866) — third-party OAuth may trigger abuse detection; single GitHub issue
|
||||
- Puter.js rate limits — "no restrictions" from Puter marketing only; no independent verification
|
||||
### Tertiary (LOW confidence — inferred from patterns)
|
||||
- Dual output prompt reliability on 7B models — inferred from structured output community reports; not benchmarked on Hermes specifically
|
||||
- Piper persistent HTTP wrapper — community pattern referenced; `piper-http` package status not verified
|
||||
- `sanitizeForTTS` utility pattern — inferred from TTS pipeline implementations; implementation detail not sourced from a canonical reference
|
||||
|
||||
---
|
||||
*Research completed: 2026-04-02*
|
||||
|
||||
*Research completed: 2026-04-03*
|
||||
*Ready for roadmap: yes*
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue