Nexus Dev 0c29013931 docs: complete project research

2026-04-03 23:53:14 +00:00

19 KiB

Raw Blame History

Project Research Summary

Project: Nexus v1.6 — Voice Pipeline + Telegram Bridge Domain: Server-side STT/TTS voice pipeline with transport-agnostic service abstraction and a minimal Telegram relay bridge Researched: 2026-04-03 Confidence: MEDIUM-HIGH

Executive Summary

Nexus v1.6 adds two parallel capability tracks onto an existing React/Express/Paperclip monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a minimal Telegram bridge that reuses those pipeline primitives for phone access. The established expert pattern for this class of system is a shared service abstraction (voicePipelineService) that both the web HTTP layer and the Telegram bot call directly — never duplicating STT/TTS logic across transports. The Telegram bridge must be a thin relay only, forwarding messages to the existing chatService and returning the response, with no separate bot personality, no rich UI elements, and no per-user conversation branching beyond the existing single-workspace model.

The recommended approach is to build voicePipelineService first as the keystone service (transcribe, synthesize, formatForVoice), then wire the web voice UI improvements on top of it, then attach the Telegram bridge as a consumer of the same service. Audio format conversion via ffmpeg-static (not the archived fluent-ffmpeg) handles the two required transcoding paths: browser WebM/Opus to WAV 16kHz for Whisper, and Telegram OGG/Opus to WAV 16kHz for Whisper. The @ricky0123/vad-react library handles browser-side voice activity detection. grammy ^1.41.1 handles the Telegram bot layer with long polling (correct for a local Mac Mini deployment without a public HTTPS endpoint).

The key risks are: (1) audio format mismatches causing silent transcription failures across browsers and the Telegram path, which require ffmpeg transcoding at every entry point; (2) the voice mode flag being stripped as it traverses the message pipeline layers, causing agents to respond with full markdown that TTS then renders as "asterisk asterisk important asterisk asterisk"; (3) Piper being invoked as a new process per request, causing 200–800ms model reload latency on every TTS response and silent truncation on responses over ~400 characters; and (4) browser autoplay policy blocking audio playback unless the AudioContext is unlocked during the user's initial "start voice mode" gesture.

Key Findings

Recommended Stack

v1.6 is additive to the v1.5 stack. The existing smart-whisper, @mintplex-labs/piper-tts-web, multer, and Express foundations remain unchanged. Three new libraries are required.

Core technologies:

@ricky0123/vad-react ^0.0.36 (ui/) — Browser-side Silero VAD via ONNX Runtime Web; delivers Float32Array at 16kHz on speech end; React 19 peer dep confirmed fixed August 2025; requires COOP/COEP headers for SharedArrayBuffer
ffmpeg-static ^5.2.0 (server/) — Ships FFmpeg 6.1.1 binaries including macOS arm64; invoked via child_process.spawn; do NOT use the archived fluent-ffmpeg (archived May 2025) or stale @ffmpeg-installer/ffmpeg (FFmpeg 4.x)
grammy ^1.41.1 (server/) — TypeScript-native Telegram bot framework (1.4M weekly downloads, higher than Telegraf); long polling for local deployment; clean file handling API via ctx.getFile(); Bot API 9.6 support confirmed

No new library is required for server-side Piper TTS (existing child_process.spawn pattern from v1.5) or audio playback (native <audio> element + Web Audio API).

Critical compatibility note: @ricky0123/vad-react requires COOP/COEP HTTP headers on HTML responses for SharedArrayBuffer support. Without them, VAD silently fails in Chrome and Firefox. One-line addition to Express static file middleware.

Expected Features

Must have (table stakes — v1.6 launch):

Silence-based auto-submit via @ricky0123/vad-react — users expect this; manual stop feels archaic
Waveform/amplitude visualization while recording — without it users cannot confirm mic is active
Voice response auto-play with toggle — users expect playback to be automatic unless disabled
Markdown-free voice responses — spoken markdown sounds broken; dual output (prose + full markdown) is the correct solution
Telegram text relay with agent prefix — core use case for phone access; format: [AgentName]: response
Telegram voice note transcription — mobile Telegram users default to voice notes; ignoring them immediately frustrates

Should have (differentiators, add after validation):

Telegram TTS reply option (OGG voice note reply back) — add after text relay is validated
Sentence-buffered TTS streaming — start playing sentence 1 while sentence 2 synthesizes; reduces perceived latency

Defer (v2+):

Real-time speech-to-speech — requires full-duplex WebSocket audio + Pipecat/LiveKit; entirely different architecture
Wake word detection — always-on mic; hardware device concern
Deep Telegram web chat session sync — requires Postgres pub/sub event bus; explicitly deferred per PROJECT.md
Per-agent Telegram bots — maintenance nightmare; single bot + agent prefix is the correct approach

Architecture Approach

The architecture is built around a single server-side voicePipelineService that both HTTP voice routes and the Telegram relay call directly, with no HTTP round-trip within the same process. The existing chatService and puterProxyService are consumed directly by the Telegram bridge as TypeScript function calls. nexus-settings.json (not DB) stores voiceMode enum and telegramToken. No DB schema changes are required.

Major components:

voicePipelineService (server/src/services/voice-pipeline.ts) — Transport-agnostic STT/TTS core; transcribe(buffer, format), synthesize(text, voiceId?), formatForVoice(text) — the keystone abstraction for v1.6
telegram service (server/src/services/telegram.ts) — grammY bot lifecycle + thin relay; calls voicePipelineService and chatService directly; long polling; one persistent sessionId per Telegram chatId
voice.ts route (server/src/routes/voice.ts) — HTTP wrappers for POST /api/transcribe (moved from chat-files.ts) and new POST /api/synthesize; keeps chat-files.ts close to upstream for clean rebases
UI voice components (VoiceMicButton, WaveformDisplay, VoiceModeToggle, useVoiceMode, useSilenceDetection) — all new; enhance existing ChatInput without replacing VoiceRecordButton
nexus-settings schema extension — adds voiceMode: "text" | "voice_input" | "full_voice" and optional telegramToken; no DB migration needed

Key patterns to follow:

Move /transcribe out of chat-files.ts into voice.ts to reduce upstream rebase conflict surface
Use execFile (not exec) for CLI subprocess calls — prevents shell injection, matches existing codebase pattern
Store Telegram token in nexus-settings.json, not in DB — DB migrations conflict on rebase
Long polling (bot.start()) not webhooks — Mac Mini is behind NAT with no public HTTPS endpoint
Wrap all CLI calls (piper, ffmpeg) in Promise.race([call, timeout(8000)]) for graceful degradation

Critical Pitfalls

Audio format mismatch at every entry point (Pitfall 27, 28) — Browser produces WebM/Opus. Telegram produces OGG/Opus 48kHz. Whisper requires WAV 16kHz mono. Always transcode via ffmpeg at every audio entry point with explicit -ar 16000 -ac 1. Make ffmpeg a hard startup dependency with absolute binary path, not PATH-resolved.
Voice mode flag stripped in message pipeline (Pitfall 32) — The voiceMode: true flag on messages must survive every pipeline layer (client → Express → message persistence → agent session codec → Hermes adapter system prompt). If stripped at any layer, the agent responds in full markdown and TTS synthesizes spoken symbols. Audit every layer before building dual output on top of it.
Piper process-per-request anti-pattern (Pitfall 29) — Spawning a new piper process per TTS request reloads the ONNX model each time (200–800ms overhead). Long responses (>400 chars) silently truncate. Sentence-chunk text before synthesis. Implement warmup call at server startup. Use absolute binary paths for service-mode deployment.
Browser autoplay policy blocking TTS playback (Pitfall 40) — audio.play() is blocked unless triggered by a user gesture. The "start voice mode" button click must unlock an AudioContext (ctx.resume()); subsequent programmatic playback via AudioBufferSourceNode works without further gestures. Developers with autoplay whitelisted in dev browsers never see this failure.
Telegram bot event loop blocking on voice pipeline (Pitfall 37) — File download + ffmpeg transcode + Whisper transcription takes 2–5 seconds. If the handler awaits all of this synchronously, Telegram resends the update and the bot processes the same voice message multiple times. Acknowledge the update immediately, process async, send intermediate "Transcribing..." status to user.
Piper/ffmpeg not found when running as system service (Pitfall 38) — spawn('piper', ...) resolves via shell PATH in interactive terminals but not in launchd/systemd service environments. Store absolute binary paths in nexus-settings config; use them explicitly in every spawn() call.

Implications for Roadmap

Based on research, the component dependency graph strongly suggests a 4-phase structure:

Phase 1: Voice Pipeline Foundation

Rationale: voicePipelineService is the keystone — every other v1.6 feature calls it. Cannot build web voice UI improvements or the Telegram bridge without it. Schema extension for voiceMode also gates downstream work. Moving /transcribe to voice.ts reduces rebase friction before any other work begins. Delivers: nexus-settings schema with voiceMode + telegramToken; voicePipelineService with transcribe, synthesize, formatForVoice; voice.ts route with /api/transcribe (moved from chat-files.ts) and /api/synthesize; ffmpeg integration for WebM→WAV and OGG→WAV transcoding; voiceMode flag on createMessageSchema and ChatMessage shared type Addresses: Transport-agnostic pipeline (differentiator unlocking all features), voice mode flag storage (required by all consumers), server-side synthesize endpoint (required by Telegram bridge) Avoids: Pitfall 27 (audio format mismatch), Pitfall 32 (voice flag propagation path established before consumers built), Pitfall 38 (absolute binary paths baked in from the start), Pitfall 29 (sentence-chunked synthesis from the start) Research flag: Standard patterns — execFile, WAV format conversion, service abstraction are well-documented. Skip /gsd:research-phase.

Phase 2: Web Chat Voice UI

Rationale: UI improvements depend only on Phase 1 pipeline and are independent of Telegram. Establishes the voice UX foundation that users interact with directly. Validates the voice mode flag end-to-end before Telegram consumes the same flag. Delivers: VoiceMicButton with @ricky0123/vad-react silence detection; WaveformDisplay via AnalyserNode; VoiceModeToggle three-state control; useVoiceMode and useSilenceDetection hooks; ChatMessage dual output (voice badge + expandable full markdown); TtsButton auto-play prop; COOP/COEP headers on Express static middleware Addresses: Silence auto-submit (table stakes), waveform visualization (table stakes), auto-play toggle (table stakes), voice mode setting (table stakes), markdown-free voice responses (table stakes) Avoids: Pitfall 31 (VAD library vs. naive RMS threshold), Pitfall 40 (AudioContext unlocked on voice mode start button), Pitfall 35 (sanitizeForTTS utility exists before first TTS integration test) Research flag: @ricky0123/vad-react API is confirmed via docs; COOP/COEP header pattern is standard Express middleware. Skip /gsd:research-phase.

Phase 3: Telegram Bridge

Rationale: Telegram bridge is a pure consumer of Phase 1's voicePipelineService and the existing chatService. No web UI changes needed. Must follow Phase 1 but is independent of Phase 2. Delivers: telegramService with grammY long polling; text relay to chatService; voice note relay (OGG download → ffmpeg transcode → transcribe → agent → text reply); persistent chatId → sessionId mapping; agent prefix on replies; POST /api/telegram/token and GET /api/telegram/status management routes Addresses: Telegram text relay (table stakes), Telegram voice note relay (table stakes), agent identity visible in Telegram replies (table stakes) Avoids: Pitfall 28 (OGG 48kHz → WAV 16kHz explicit transcode, not assumed), Pitfall 33 (persistent session per chatId, not per message), Pitfall 34 (long polling; delete any existing webhook first), Pitfall 37 (async pipeline; acknowledge immediately; send "Transcribing..." status) Research flag: Needs /gsd:research-phase for grammY session management (persistent chatId → sessionId mapping approach vs. grammY conversation plugin) and async update acknowledgement pattern before implementation.

Phase 4: Polish and Post-Launch Additions

Rationale: After core voice and Telegram are validated, add differentiator features that require voice pipeline stability. These are explicitly post-validation based on user feedback triggers. Delivers: Telegram TTS reply (synthesize OGG voice note reply); sentence-buffered TTS streaming; Piper persistent warmup optimization; voice response history in chat UI Addresses: Sentence-buffered TTS (differentiator), Telegram TTS reply (differentiator) Avoids: Pitfall 39 (dual output via single LLM call, not two calls), Pitfall 29 (persistent Piper process architecture) Research flag: Flag for /gsd:research-phase on Piper persistent HTTP wrapper — community piper-http package status is unconfirmed; verify before committing to this approach.

Phase Ordering Rationale

voicePipelineService (Phase 1) strictly precedes both Phase 2 and Phase 3 — this is the hardest dependency in the v1.6 graph
Phase 2 and Phase 3 are independent of each other and can run in parallel for two-developer teams; sequential ordering here assumes single-developer delivery
voiceMode schema change (Phase 1) must precede ChatMessage dual output (Phase 2) — shared package change gates UI work
Moving /transcribe from chat-files.ts to voice.ts in Phase 1 reduces rebase conflict surface before any other work begins
Phase 4 is explicitly post-validation — only add Telegram TTS reply and sentence-buffered streaming after confirming the basic pipeline is stable in real use

Confidence Assessment

Area	Confidence	Notes
Stack	MEDIUM-HIGH	grammy HIGH (official docs, Bot API 9.6 verified); ffmpeg-static MEDIUM (arm64 confirmed, pipe approach verified); vad-react MEDIUM (React 19 fix confirmed via GitHub issue; ONNX WASM SharedArrayBuffer behavior requires COOP/COEP header testing)
Features	MEDIUM-HIGH	STT/TTS pipeline patterns well-documented; dual output prompt engineering reliability is MEDIUM — smaller 7B models produce malformed structured output ~10% of the time; Approach B fallback (post-processing strip) must be implemented
Architecture	HIGH	Based on direct codebase inspection of actual source files; service boundary and data flow verified; no speculative assumptions
Pitfalls	HIGH	Based on direct codebase analysis plus targeted research on each integration domain; v1.6 pitfalls 27–40 are specific, sourced, and actionable

Overall confidence: MEDIUM-HIGH

Gaps to Address

grammY session management approach: Lightweight in-memory Map<chatId, sessionId> vs. grammY conversation plugin — not evaluated. Validate during Phase 3 research-phase before implementation.
Dual output prompt reliability on 7B models: Works reliably on larger models; ~90% on 7B tier. Approach B fallback (post-processing strip) must be implemented as a safety net, not treated as optional. Design both before Phase 1 ships.
Piper persistent process viability: Sentence-chunked per-request synthesis avoids the worst of the reload latency, but a persistent Piper HTTP wrapper would be cleaner long-term. Community piper-http status unconfirmed. Flag for Phase 4 research-phase.
smart-whisper OGG support: Whether smart-whisper can ingest OGG directly (avoiding ffmpeg for the Telegram path) or always requires WAV was not confirmed. Verify at Phase 1 start — if OGG is accepted natively, the Telegram transcription path can skip one transcode step.

Sources

Primary (HIGH confidence)

grammY official docs — TypeScript support, long polling, file handling, Bot API 9.6 support
grammY deployment types guide — long polling vs. webhooks recommendation for local deployment
ffmpeg-static GitHub — macOS arm64 binary confirmed, FFmpeg 6.1.1, pipe-based invocation pattern
Telegram Bot API sendVoice — OGG Opus format requirement, 48kHz mono wire format
Direct codebase inspection: server/src/routes/chat-files.ts, chat.ts, services/nexus-settings.ts, app.ts, ui/src/components/VoiceRecordButton.tsx, TtsButton.tsx, hooks/usePiperTts.ts, packages/shared/src/validators/chat.ts, packages/shared/src/types/chat.ts
.planning/STATE.md — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)

Secondary (MEDIUM confidence)

@ricky0123/vad-react npm — v0.0.36, React 19 fix confirmed
vad React 19 support issue #188 — React 19 peer dep fix confirmed August 2025
vad API docs — onSpeechEnd Float32Array 16kHz output confirmed
fluent-ffmpeg archival — archived May 22 2025, confirmed
Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)
The Voice AI Stack for Building Agents (assemblyai.com)
Telegram speech-to-text bot with Node.js (loonskai.com)
grammY file handling guide — ctx.getFile(), download pattern

Tertiary (LOW confidence — inferred from patterns)

Dual output prompt reliability on 7B models — inferred from structured output community reports; not benchmarked on Hermes specifically
Piper persistent HTTP wrapper — community pattern referenced; piper-http package status not verified
sanitizeForTTS utility pattern — inferred from TTS pipeline implementations; implementation detail not sourced from a canonical reference

Research completed: 2026-04-03 Ready for roadmap: yes

19 KiB Raw Blame History Unescape Escape