nexus/.planning/research/SUMMARY.md
2026-04-03 23:53:14 +00:00

185 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project Research Summary
**Project:** Nexus v1.6 — Voice Pipeline + Telegram Bridge
**Domain:** Server-side STT/TTS voice pipeline with transport-agnostic service abstraction and a minimal Telegram relay bridge
**Researched:** 2026-04-03
**Confidence:** MEDIUM-HIGH
---
## Executive Summary
Nexus v1.6 adds two parallel capability tracks onto an existing React/Express/Paperclip monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a minimal Telegram bridge that reuses those pipeline primitives for phone access. The established expert pattern for this class of system is a shared service abstraction (`voicePipelineService`) that both the web HTTP layer and the Telegram bot call directly — never duplicating STT/TTS logic across transports. The Telegram bridge must be a thin relay only, forwarding messages to the existing `chatService` and returning the response, with no separate bot personality, no rich UI elements, and no per-user conversation branching beyond the existing single-workspace model.
The recommended approach is to build `voicePipelineService` first as the keystone service (`transcribe`, `synthesize`, `formatForVoice`), then wire the web voice UI improvements on top of it, then attach the Telegram bridge as a consumer of the same service. Audio format conversion via `ffmpeg-static` (not the archived `fluent-ffmpeg`) handles the two required transcoding paths: browser WebM/Opus to WAV 16kHz for Whisper, and Telegram OGG/Opus to WAV 16kHz for Whisper. The `@ricky0123/vad-react` library handles browser-side voice activity detection. `grammy ^1.41.1` handles the Telegram bot layer with long polling (correct for a local Mac Mini deployment without a public HTTPS endpoint).
The key risks are: (1) audio format mismatches causing silent transcription failures across browsers and the Telegram path, which require ffmpeg transcoding at every entry point; (2) the voice mode flag being stripped as it traverses the message pipeline layers, causing agents to respond with full markdown that TTS then renders as "asterisk asterisk important asterisk asterisk"; (3) Piper being invoked as a new process per request, causing 200800ms model reload latency on every TTS response and silent truncation on responses over ~400 characters; and (4) browser autoplay policy blocking audio playback unless the `AudioContext` is unlocked during the user's initial "start voice mode" gesture.
---
## Key Findings
### Recommended Stack
v1.6 is additive to the v1.5 stack. The existing `smart-whisper`, `@mintplex-labs/piper-tts-web`, `multer`, and Express foundations remain unchanged. Three new libraries are required.
**Core technologies:**
- `@ricky0123/vad-react ^0.0.36` (ui/) — Browser-side Silero VAD via ONNX Runtime Web; delivers `Float32Array` at 16kHz on speech end; React 19 peer dep confirmed fixed August 2025; requires COOP/COEP headers for `SharedArrayBuffer`
- `ffmpeg-static ^5.2.0` (server/) — Ships FFmpeg 6.1.1 binaries including macOS arm64; invoked via `child_process.spawn`; do NOT use the archived `fluent-ffmpeg` (archived May 2025) or stale `@ffmpeg-installer/ffmpeg` (FFmpeg 4.x)
- `grammy ^1.41.1` (server/) — TypeScript-native Telegram bot framework (1.4M weekly downloads, higher than Telegraf); long polling for local deployment; clean file handling API via `ctx.getFile()`; Bot API 9.6 support confirmed
No new library is required for server-side Piper TTS (existing `child_process.spawn` pattern from v1.5) or audio playback (native `<audio>` element + Web Audio API).
**Critical compatibility note:** `@ricky0123/vad-react` requires COOP/COEP HTTP headers on HTML responses for `SharedArrayBuffer` support. Without them, VAD silently fails in Chrome and Firefox. One-line addition to Express static file middleware.
### Expected Features
**Must have (table stakes — v1.6 launch):**
- Silence-based auto-submit via `@ricky0123/vad-react` — users expect this; manual stop feels archaic
- Waveform/amplitude visualization while recording — without it users cannot confirm mic is active
- Voice response auto-play with toggle — users expect playback to be automatic unless disabled
- Markdown-free voice responses — spoken markdown sounds broken; dual output (prose + full markdown) is the correct solution
- Telegram text relay with agent prefix — core use case for phone access; format: `[AgentName]: response`
- Telegram voice note transcription — mobile Telegram users default to voice notes; ignoring them immediately frustrates
**Should have (differentiators, add after validation):**
- Telegram TTS reply option (OGG voice note reply back) — add after text relay is validated
- Sentence-buffered TTS streaming — start playing sentence 1 while sentence 2 synthesizes; reduces perceived latency
**Defer (v2+):**
- Real-time speech-to-speech — requires full-duplex WebSocket audio + Pipecat/LiveKit; entirely different architecture
- Wake word detection — always-on mic; hardware device concern
- Deep Telegram web chat session sync — requires Postgres pub/sub event bus; explicitly deferred per PROJECT.md
- Per-agent Telegram bots — maintenance nightmare; single bot + agent prefix is the correct approach
### Architecture Approach
The architecture is built around a single server-side `voicePipelineService` that both HTTP voice routes and the Telegram relay call directly, with no HTTP round-trip within the same process. The existing `chatService` and `puterProxyService` are consumed directly by the Telegram bridge as TypeScript function calls. `nexus-settings.json` (not DB) stores `voiceMode` enum and `telegramToken`. No DB schema changes are required.
**Major components:**
1. `voicePipelineService` (`server/src/services/voice-pipeline.ts`) — Transport-agnostic STT/TTS core; `transcribe(buffer, format)`, `synthesize(text, voiceId?)`, `formatForVoice(text)` — the keystone abstraction for v1.6
2. `telegram service` (`server/src/services/telegram.ts`) — grammY bot lifecycle + thin relay; calls `voicePipelineService` and `chatService` directly; long polling; one persistent `sessionId` per Telegram `chatId`
3. `voice.ts` route (`server/src/routes/voice.ts`) — HTTP wrappers for `POST /api/transcribe` (moved from `chat-files.ts`) and new `POST /api/synthesize`; keeps `chat-files.ts` close to upstream for clean rebases
4. UI voice components (`VoiceMicButton`, `WaveformDisplay`, `VoiceModeToggle`, `useVoiceMode`, `useSilenceDetection`) — all new; enhance existing `ChatInput` without replacing `VoiceRecordButton`
5. `nexus-settings` schema extension — adds `voiceMode: "text" | "voice_input" | "full_voice"` and optional `telegramToken`; no DB migration needed
**Key patterns to follow:**
- Move `/transcribe` out of `chat-files.ts` into `voice.ts` to reduce upstream rebase conflict surface
- Use `execFile` (not `exec`) for CLI subprocess calls — prevents shell injection, matches existing codebase pattern
- Store Telegram token in `nexus-settings.json`, not in DB — DB migrations conflict on rebase
- Long polling (`bot.start()`) not webhooks — Mac Mini is behind NAT with no public HTTPS endpoint
- Wrap all CLI calls (`piper`, `ffmpeg`) in `Promise.race([call, timeout(8000)])` for graceful degradation
### Critical Pitfalls
1. **Audio format mismatch at every entry point** (Pitfall 27, 28) — Browser produces WebM/Opus. Telegram produces OGG/Opus 48kHz. Whisper requires WAV 16kHz mono. Always transcode via ffmpeg at every audio entry point with explicit `-ar 16000 -ac 1`. Make ffmpeg a hard startup dependency with absolute binary path, not PATH-resolved.
2. **Voice mode flag stripped in message pipeline** (Pitfall 32) — The `voiceMode: true` flag on messages must survive every pipeline layer (client → Express → message persistence → agent session codec → Hermes adapter system prompt). If stripped at any layer, the agent responds in full markdown and TTS synthesizes spoken symbols. Audit every layer before building dual output on top of it.
3. **Piper process-per-request anti-pattern** (Pitfall 29) — Spawning a new `piper` process per TTS request reloads the ONNX model each time (200800ms overhead). Long responses (>400 chars) silently truncate. Sentence-chunk text before synthesis. Implement warmup call at server startup. Use absolute binary paths for service-mode deployment.
4. **Browser autoplay policy blocking TTS playback** (Pitfall 40) — `audio.play()` is blocked unless triggered by a user gesture. The "start voice mode" button click must unlock an `AudioContext` (`ctx.resume()`); subsequent programmatic playback via `AudioBufferSourceNode` works without further gestures. Developers with autoplay whitelisted in dev browsers never see this failure.
5. **Telegram bot event loop blocking on voice pipeline** (Pitfall 37) — File download + ffmpeg transcode + Whisper transcription takes 25 seconds. If the handler awaits all of this synchronously, Telegram resends the update and the bot processes the same voice message multiple times. Acknowledge the update immediately, process async, send intermediate "Transcribing..." status to user.
6. **Piper/ffmpeg not found when running as system service** (Pitfall 38) — `spawn('piper', ...)` resolves via shell PATH in interactive terminals but not in `launchd`/`systemd` service environments. Store absolute binary paths in `nexus-settings` config; use them explicitly in every `spawn()` call.
---
## Implications for Roadmap
Based on research, the component dependency graph strongly suggests a 4-phase structure:
### Phase 1: Voice Pipeline Foundation
**Rationale:** `voicePipelineService` is the keystone — every other v1.6 feature calls it. Cannot build web voice UI improvements or the Telegram bridge without it. Schema extension for `voiceMode` also gates downstream work. Moving `/transcribe` to `voice.ts` reduces rebase friction before any other work begins.
**Delivers:** `nexus-settings` schema with `voiceMode` + `telegramToken`; `voicePipelineService` with `transcribe`, `synthesize`, `formatForVoice`; `voice.ts` route with `/api/transcribe` (moved from `chat-files.ts`) and `/api/synthesize`; ffmpeg integration for WebM→WAV and OGG→WAV transcoding; `voiceMode` flag on `createMessageSchema` and `ChatMessage` shared type
**Addresses:** Transport-agnostic pipeline (differentiator unlocking all features), voice mode flag storage (required by all consumers), server-side synthesize endpoint (required by Telegram bridge)
**Avoids:** Pitfall 27 (audio format mismatch), Pitfall 32 (voice flag propagation path established before consumers built), Pitfall 38 (absolute binary paths baked in from the start), Pitfall 29 (sentence-chunked synthesis from the start)
**Research flag:** Standard patterns — `execFile`, WAV format conversion, service abstraction are well-documented. Skip `/gsd:research-phase`.
### Phase 2: Web Chat Voice UI
**Rationale:** UI improvements depend only on Phase 1 pipeline and are independent of Telegram. Establishes the voice UX foundation that users interact with directly. Validates the voice mode flag end-to-end before Telegram consumes the same flag.
**Delivers:** `VoiceMicButton` with `@ricky0123/vad-react` silence detection; `WaveformDisplay` via AnalyserNode; `VoiceModeToggle` three-state control; `useVoiceMode` and `useSilenceDetection` hooks; `ChatMessage` dual output (voice badge + expandable full markdown); `TtsButton` auto-play prop; COOP/COEP headers on Express static middleware
**Addresses:** Silence auto-submit (table stakes), waveform visualization (table stakes), auto-play toggle (table stakes), voice mode setting (table stakes), markdown-free voice responses (table stakes)
**Avoids:** Pitfall 31 (VAD library vs. naive RMS threshold), Pitfall 40 (AudioContext unlocked on voice mode start button), Pitfall 35 (sanitizeForTTS utility exists before first TTS integration test)
**Research flag:** `@ricky0123/vad-react` API is confirmed via docs; COOP/COEP header pattern is standard Express middleware. Skip `/gsd:research-phase`.
### Phase 3: Telegram Bridge
**Rationale:** Telegram bridge is a pure consumer of Phase 1's `voicePipelineService` and the existing `chatService`. No web UI changes needed. Must follow Phase 1 but is independent of Phase 2.
**Delivers:** `telegramService` with grammY long polling; text relay to `chatService`; voice note relay (OGG download → ffmpeg transcode → transcribe → agent → text reply); persistent `chatId → sessionId` mapping; agent prefix on replies; `POST /api/telegram/token` and `GET /api/telegram/status` management routes
**Addresses:** Telegram text relay (table stakes), Telegram voice note relay (table stakes), agent identity visible in Telegram replies (table stakes)
**Avoids:** Pitfall 28 (OGG 48kHz → WAV 16kHz explicit transcode, not assumed), Pitfall 33 (persistent session per chatId, not per message), Pitfall 34 (long polling; delete any existing webhook first), Pitfall 37 (async pipeline; acknowledge immediately; send "Transcribing..." status)
**Research flag:** Needs `/gsd:research-phase` for grammY session management (persistent `chatId → sessionId` mapping approach vs. grammY conversation plugin) and async update acknowledgement pattern before implementation.
### Phase 4: Polish and Post-Launch Additions
**Rationale:** After core voice and Telegram are validated, add differentiator features that require voice pipeline stability. These are explicitly post-validation based on user feedback triggers.
**Delivers:** Telegram TTS reply (synthesize OGG voice note reply); sentence-buffered TTS streaming; Piper persistent warmup optimization; voice response history in chat UI
**Addresses:** Sentence-buffered TTS (differentiator), Telegram TTS reply (differentiator)
**Avoids:** Pitfall 39 (dual output via single LLM call, not two calls), Pitfall 29 (persistent Piper process architecture)
**Research flag:** Flag for `/gsd:research-phase` on Piper persistent HTTP wrapper — community `piper-http` package status is unconfirmed; verify before committing to this approach.
### Phase Ordering Rationale
- `voicePipelineService` (Phase 1) strictly precedes both Phase 2 and Phase 3 — this is the hardest dependency in the v1.6 graph
- Phase 2 and Phase 3 are independent of each other and can run in parallel for two-developer teams; sequential ordering here assumes single-developer delivery
- `voiceMode` schema change (Phase 1) must precede `ChatMessage` dual output (Phase 2) — shared package change gates UI work
- Moving `/transcribe` from `chat-files.ts` to `voice.ts` in Phase 1 reduces rebase conflict surface before any other work begins
- Phase 4 is explicitly post-validation — only add Telegram TTS reply and sentence-buffered streaming after confirming the basic pipeline is stable in real use
---
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | MEDIUM-HIGH | grammy HIGH (official docs, Bot API 9.6 verified); ffmpeg-static MEDIUM (arm64 confirmed, pipe approach verified); vad-react MEDIUM (React 19 fix confirmed via GitHub issue; ONNX WASM SharedArrayBuffer behavior requires COOP/COEP header testing) |
| Features | MEDIUM-HIGH | STT/TTS pipeline patterns well-documented; dual output prompt engineering reliability is MEDIUM — smaller 7B models produce malformed structured output ~10% of the time; Approach B fallback (post-processing strip) must be implemented |
| Architecture | HIGH | Based on direct codebase inspection of actual source files; service boundary and data flow verified; no speculative assumptions |
| Pitfalls | HIGH | Based on direct codebase analysis plus targeted research on each integration domain; v1.6 pitfalls 2740 are specific, sourced, and actionable |
**Overall confidence:** MEDIUM-HIGH
### Gaps to Address
- **grammY session management approach:** Lightweight in-memory `Map<chatId, sessionId>` vs. grammY conversation plugin — not evaluated. Validate during Phase 3 research-phase before implementation.
- **Dual output prompt reliability on 7B models:** Works reliably on larger models; ~90% on 7B tier. Approach B fallback (post-processing strip) must be implemented as a safety net, not treated as optional. Design both before Phase 1 ships.
- **Piper persistent process viability:** Sentence-chunked per-request synthesis avoids the worst of the reload latency, but a persistent Piper HTTP wrapper would be cleaner long-term. Community `piper-http` status unconfirmed. Flag for Phase 4 research-phase.
- **smart-whisper OGG support:** Whether `smart-whisper` can ingest OGG directly (avoiding ffmpeg for the Telegram path) or always requires WAV was not confirmed. Verify at Phase 1 start — if OGG is accepted natively, the Telegram transcription path can skip one transcode step.
---
## Sources
### Primary (HIGH confidence)
- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling, Bot API 9.6 support
- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling vs. webhooks recommendation for local deployment
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1, pipe-based invocation pattern
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement, 48kHz mono wire format
- Direct codebase inspection: `server/src/routes/chat-files.ts`, `chat.ts`, `services/nexus-settings.ts`, `app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `TtsButton.tsx`, `hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts`
- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
### Secondary (MEDIUM confidence)
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, React 19 fix confirmed
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — React 19 peer dep fix confirmed August 2025
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz output confirmed
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
- [Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
- [The Voice AI Stack for Building Agents (assemblyai.com)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
- [Telegram speech-to-text bot with Node.js (loonskai.com)](https://www.loonskai.com/blog/telegram-speech-to-text-bot-with-nodejs)
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
### Tertiary (LOW confidence — inferred from patterns)
- Dual output prompt reliability on 7B models — inferred from structured output community reports; not benchmarked on Hermes specifically
- Piper persistent HTTP wrapper — community pattern referenced; `piper-http` package status not verified
- `sanitizeForTTS` utility pattern — inferred from TTS pipeline implementations; implementation detail not sourced from a canonical reference
---
*Research completed: 2026-04-03*
*Ready for roadmap: yes*