19 KiB
Project Research Summary
Project: Nexus v1.6 — Voice Pipeline + Telegram Bridge Domain: Server-side STT/TTS voice pipeline with transport-agnostic service abstraction and a minimal Telegram relay bridge Researched: 2026-04-03 Confidence: MEDIUM-HIGH
Executive Summary
Nexus v1.6 adds two parallel capability tracks onto an existing React/Express/Paperclip monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a minimal Telegram bridge that reuses those pipeline primitives for phone access. The established expert pattern for this class of system is a shared service abstraction (voicePipelineService) that both the web HTTP layer and the Telegram bot call directly — never duplicating STT/TTS logic across transports. The Telegram bridge must be a thin relay only, forwarding messages to the existing chatService and returning the response, with no separate bot personality, no rich UI elements, and no per-user conversation branching beyond the existing single-workspace model.
The recommended approach is to build voicePipelineService first as the keystone service (transcribe, synthesize, formatForVoice), then wire the web voice UI improvements on top of it, then attach the Telegram bridge as a consumer of the same service. Audio format conversion via ffmpeg-static (not the archived fluent-ffmpeg) handles the two required transcoding paths: browser WebM/Opus to WAV 16kHz for Whisper, and Telegram OGG/Opus to WAV 16kHz for Whisper. The @ricky0123/vad-react library handles browser-side voice activity detection. grammy ^1.41.1 handles the Telegram bot layer with long polling (correct for a local Mac Mini deployment without a public HTTPS endpoint).
The key risks are: (1) audio format mismatches causing silent transcription failures across browsers and the Telegram path, which require ffmpeg transcoding at every entry point; (2) the voice mode flag being stripped as it traverses the message pipeline layers, causing agents to respond with full markdown that TTS then renders as "asterisk asterisk important asterisk asterisk"; (3) Piper being invoked as a new process per request, causing 200–800ms model reload latency on every TTS response and silent truncation on responses over ~400 characters; and (4) browser autoplay policy blocking audio playback unless the AudioContext is unlocked during the user's initial "start voice mode" gesture.
Key Findings
Recommended Stack
v1.6 is additive to the v1.5 stack. The existing smart-whisper, @mintplex-labs/piper-tts-web, multer, and Express foundations remain unchanged. Three new libraries are required.
Core technologies:
@ricky0123/vad-react ^0.0.36(ui/) — Browser-side Silero VAD via ONNX Runtime Web; deliversFloat32Arrayat 16kHz on speech end; React 19 peer dep confirmed fixed August 2025; requires COOP/COEP headers forSharedArrayBufferffmpeg-static ^5.2.0(server/) — Ships FFmpeg 6.1.1 binaries including macOS arm64; invoked viachild_process.spawn; do NOT use the archivedfluent-ffmpeg(archived May 2025) or stale@ffmpeg-installer/ffmpeg(FFmpeg 4.x)grammy ^1.41.1(server/) — TypeScript-native Telegram bot framework (1.4M weekly downloads, higher than Telegraf); long polling for local deployment; clean file handling API viactx.getFile(); Bot API 9.6 support confirmed
No new library is required for server-side Piper TTS (existing child_process.spawn pattern from v1.5) or audio playback (native <audio> element + Web Audio API).
Critical compatibility note: @ricky0123/vad-react requires COOP/COEP HTTP headers on HTML responses for SharedArrayBuffer support. Without them, VAD silently fails in Chrome and Firefox. One-line addition to Express static file middleware.
Expected Features
Must have (table stakes — v1.6 launch):
- Silence-based auto-submit via
@ricky0123/vad-react— users expect this; manual stop feels archaic - Waveform/amplitude visualization while recording — without it users cannot confirm mic is active
- Voice response auto-play with toggle — users expect playback to be automatic unless disabled
- Markdown-free voice responses — spoken markdown sounds broken; dual output (prose + full markdown) is the correct solution
- Telegram text relay with agent prefix — core use case for phone access; format:
[AgentName]: response - Telegram voice note transcription — mobile Telegram users default to voice notes; ignoring them immediately frustrates
Should have (differentiators, add after validation):
- Telegram TTS reply option (OGG voice note reply back) — add after text relay is validated
- Sentence-buffered TTS streaming — start playing sentence 1 while sentence 2 synthesizes; reduces perceived latency
Defer (v2+):
- Real-time speech-to-speech — requires full-duplex WebSocket audio + Pipecat/LiveKit; entirely different architecture
- Wake word detection — always-on mic; hardware device concern
- Deep Telegram web chat session sync — requires Postgres pub/sub event bus; explicitly deferred per PROJECT.md
- Per-agent Telegram bots — maintenance nightmare; single bot + agent prefix is the correct approach
Architecture Approach
The architecture is built around a single server-side voicePipelineService that both HTTP voice routes and the Telegram relay call directly, with no HTTP round-trip within the same process. The existing chatService and puterProxyService are consumed directly by the Telegram bridge as TypeScript function calls. nexus-settings.json (not DB) stores voiceMode enum and telegramToken. No DB schema changes are required.
Major components:
voicePipelineService(server/src/services/voice-pipeline.ts) — Transport-agnostic STT/TTS core;transcribe(buffer, format),synthesize(text, voiceId?),formatForVoice(text)— the keystone abstraction for v1.6telegram service(server/src/services/telegram.ts) — grammY bot lifecycle + thin relay; callsvoicePipelineServiceandchatServicedirectly; long polling; one persistentsessionIdper TelegramchatIdvoice.tsroute (server/src/routes/voice.ts) — HTTP wrappers forPOST /api/transcribe(moved fromchat-files.ts) and newPOST /api/synthesize; keepschat-files.tsclose to upstream for clean rebases- UI voice components (
VoiceMicButton,WaveformDisplay,VoiceModeToggle,useVoiceMode,useSilenceDetection) — all new; enhance existingChatInputwithout replacingVoiceRecordButton nexus-settingsschema extension — addsvoiceMode: "text" | "voice_input" | "full_voice"and optionaltelegramToken; no DB migration needed
Key patterns to follow:
- Move
/transcribeout ofchat-files.tsintovoice.tsto reduce upstream rebase conflict surface - Use
execFile(notexec) for CLI subprocess calls — prevents shell injection, matches existing codebase pattern - Store Telegram token in
nexus-settings.json, not in DB — DB migrations conflict on rebase - Long polling (
bot.start()) not webhooks — Mac Mini is behind NAT with no public HTTPS endpoint - Wrap all CLI calls (
piper,ffmpeg) inPromise.race([call, timeout(8000)])for graceful degradation
Critical Pitfalls
-
Audio format mismatch at every entry point (Pitfall 27, 28) — Browser produces WebM/Opus. Telegram produces OGG/Opus 48kHz. Whisper requires WAV 16kHz mono. Always transcode via ffmpeg at every audio entry point with explicit
-ar 16000 -ac 1. Make ffmpeg a hard startup dependency with absolute binary path, not PATH-resolved. -
Voice mode flag stripped in message pipeline (Pitfall 32) — The
voiceMode: trueflag on messages must survive every pipeline layer (client → Express → message persistence → agent session codec → Hermes adapter system prompt). If stripped at any layer, the agent responds in full markdown and TTS synthesizes spoken symbols. Audit every layer before building dual output on top of it. -
Piper process-per-request anti-pattern (Pitfall 29) — Spawning a new
piperprocess per TTS request reloads the ONNX model each time (200–800ms overhead). Long responses (>400 chars) silently truncate. Sentence-chunk text before synthesis. Implement warmup call at server startup. Use absolute binary paths for service-mode deployment. -
Browser autoplay policy blocking TTS playback (Pitfall 40) —
audio.play()is blocked unless triggered by a user gesture. The "start voice mode" button click must unlock anAudioContext(ctx.resume()); subsequent programmatic playback viaAudioBufferSourceNodeworks without further gestures. Developers with autoplay whitelisted in dev browsers never see this failure. -
Telegram bot event loop blocking on voice pipeline (Pitfall 37) — File download + ffmpeg transcode + Whisper transcription takes 2–5 seconds. If the handler awaits all of this synchronously, Telegram resends the update and the bot processes the same voice message multiple times. Acknowledge the update immediately, process async, send intermediate "Transcribing..." status to user.
-
Piper/ffmpeg not found when running as system service (Pitfall 38) —
spawn('piper', ...)resolves via shell PATH in interactive terminals but not inlaunchd/systemdservice environments. Store absolute binary paths innexus-settingsconfig; use them explicitly in everyspawn()call.
Implications for Roadmap
Based on research, the component dependency graph strongly suggests a 4-phase structure:
Phase 1: Voice Pipeline Foundation
Rationale: voicePipelineService is the keystone — every other v1.6 feature calls it. Cannot build web voice UI improvements or the Telegram bridge without it. Schema extension for voiceMode also gates downstream work. Moving /transcribe to voice.ts reduces rebase friction before any other work begins.
Delivers: nexus-settings schema with voiceMode + telegramToken; voicePipelineService with transcribe, synthesize, formatForVoice; voice.ts route with /api/transcribe (moved from chat-files.ts) and /api/synthesize; ffmpeg integration for WebM→WAV and OGG→WAV transcoding; voiceMode flag on createMessageSchema and ChatMessage shared type
Addresses: Transport-agnostic pipeline (differentiator unlocking all features), voice mode flag storage (required by all consumers), server-side synthesize endpoint (required by Telegram bridge)
Avoids: Pitfall 27 (audio format mismatch), Pitfall 32 (voice flag propagation path established before consumers built), Pitfall 38 (absolute binary paths baked in from the start), Pitfall 29 (sentence-chunked synthesis from the start)
Research flag: Standard patterns — execFile, WAV format conversion, service abstraction are well-documented. Skip /gsd:research-phase.
Phase 2: Web Chat Voice UI
Rationale: UI improvements depend only on Phase 1 pipeline and are independent of Telegram. Establishes the voice UX foundation that users interact with directly. Validates the voice mode flag end-to-end before Telegram consumes the same flag.
Delivers: VoiceMicButton with @ricky0123/vad-react silence detection; WaveformDisplay via AnalyserNode; VoiceModeToggle three-state control; useVoiceMode and useSilenceDetection hooks; ChatMessage dual output (voice badge + expandable full markdown); TtsButton auto-play prop; COOP/COEP headers on Express static middleware
Addresses: Silence auto-submit (table stakes), waveform visualization (table stakes), auto-play toggle (table stakes), voice mode setting (table stakes), markdown-free voice responses (table stakes)
Avoids: Pitfall 31 (VAD library vs. naive RMS threshold), Pitfall 40 (AudioContext unlocked on voice mode start button), Pitfall 35 (sanitizeForTTS utility exists before first TTS integration test)
Research flag: @ricky0123/vad-react API is confirmed via docs; COOP/COEP header pattern is standard Express middleware. Skip /gsd:research-phase.
Phase 3: Telegram Bridge
Rationale: Telegram bridge is a pure consumer of Phase 1's voicePipelineService and the existing chatService. No web UI changes needed. Must follow Phase 1 but is independent of Phase 2.
Delivers: telegramService with grammY long polling; text relay to chatService; voice note relay (OGG download → ffmpeg transcode → transcribe → agent → text reply); persistent chatId → sessionId mapping; agent prefix on replies; POST /api/telegram/token and GET /api/telegram/status management routes
Addresses: Telegram text relay (table stakes), Telegram voice note relay (table stakes), agent identity visible in Telegram replies (table stakes)
Avoids: Pitfall 28 (OGG 48kHz → WAV 16kHz explicit transcode, not assumed), Pitfall 33 (persistent session per chatId, not per message), Pitfall 34 (long polling; delete any existing webhook first), Pitfall 37 (async pipeline; acknowledge immediately; send "Transcribing..." status)
Research flag: Needs /gsd:research-phase for grammY session management (persistent chatId → sessionId mapping approach vs. grammY conversation plugin) and async update acknowledgement pattern before implementation.
Phase 4: Polish and Post-Launch Additions
Rationale: After core voice and Telegram are validated, add differentiator features that require voice pipeline stability. These are explicitly post-validation based on user feedback triggers.
Delivers: Telegram TTS reply (synthesize OGG voice note reply); sentence-buffered TTS streaming; Piper persistent warmup optimization; voice response history in chat UI
Addresses: Sentence-buffered TTS (differentiator), Telegram TTS reply (differentiator)
Avoids: Pitfall 39 (dual output via single LLM call, not two calls), Pitfall 29 (persistent Piper process architecture)
Research flag: Flag for /gsd:research-phase on Piper persistent HTTP wrapper — community piper-http package status is unconfirmed; verify before committing to this approach.
Phase Ordering Rationale
voicePipelineService(Phase 1) strictly precedes both Phase 2 and Phase 3 — this is the hardest dependency in the v1.6 graph- Phase 2 and Phase 3 are independent of each other and can run in parallel for two-developer teams; sequential ordering here assumes single-developer delivery
voiceModeschema change (Phase 1) must precedeChatMessagedual output (Phase 2) — shared package change gates UI work- Moving
/transcribefromchat-files.tstovoice.tsin Phase 1 reduces rebase conflict surface before any other work begins - Phase 4 is explicitly post-validation — only add Telegram TTS reply and sentence-buffered streaming after confirming the basic pipeline is stable in real use
Confidence Assessment
| Area | Confidence | Notes |
|---|---|---|
| Stack | MEDIUM-HIGH | grammy HIGH (official docs, Bot API 9.6 verified); ffmpeg-static MEDIUM (arm64 confirmed, pipe approach verified); vad-react MEDIUM (React 19 fix confirmed via GitHub issue; ONNX WASM SharedArrayBuffer behavior requires COOP/COEP header testing) |
| Features | MEDIUM-HIGH | STT/TTS pipeline patterns well-documented; dual output prompt engineering reliability is MEDIUM — smaller 7B models produce malformed structured output ~10% of the time; Approach B fallback (post-processing strip) must be implemented |
| Architecture | HIGH | Based on direct codebase inspection of actual source files; service boundary and data flow verified; no speculative assumptions |
| Pitfalls | HIGH | Based on direct codebase analysis plus targeted research on each integration domain; v1.6 pitfalls 27–40 are specific, sourced, and actionable |
Overall confidence: MEDIUM-HIGH
Gaps to Address
- grammY session management approach: Lightweight in-memory
Map<chatId, sessionId>vs. grammY conversation plugin — not evaluated. Validate during Phase 3 research-phase before implementation. - Dual output prompt reliability on 7B models: Works reliably on larger models; ~90% on 7B tier. Approach B fallback (post-processing strip) must be implemented as a safety net, not treated as optional. Design both before Phase 1 ships.
- Piper persistent process viability: Sentence-chunked per-request synthesis avoids the worst of the reload latency, but a persistent Piper HTTP wrapper would be cleaner long-term. Community
piper-httpstatus unconfirmed. Flag for Phase 4 research-phase. - smart-whisper OGG support: Whether
smart-whispercan ingest OGG directly (avoiding ffmpeg for the Telegram path) or always requires WAV was not confirmed. Verify at Phase 1 start — if OGG is accepted natively, the Telegram transcription path can skip one transcode step.
Sources
Primary (HIGH confidence)
- grammY official docs — TypeScript support, long polling, file handling, Bot API 9.6 support
- grammY deployment types guide — long polling vs. webhooks recommendation for local deployment
- ffmpeg-static GitHub — macOS arm64 binary confirmed, FFmpeg 6.1.1, pipe-based invocation pattern
- Telegram Bot API sendVoice — OGG Opus format requirement, 48kHz mono wire format
- Direct codebase inspection:
server/src/routes/chat-files.ts,chat.ts,services/nexus-settings.ts,app.ts,ui/src/components/VoiceRecordButton.tsx,TtsButton.tsx,hooks/usePiperTts.ts,packages/shared/src/validators/chat.ts,packages/shared/src/types/chat.ts .planning/STATE.md— v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
Secondary (MEDIUM confidence)
- @ricky0123/vad-react npm — v0.0.36, React 19 fix confirmed
- vad React 19 support issue #188 — React 19 peer dep fix confirmed August 2025
- vad API docs —
onSpeechEndFloat32Array 16kHz output confirmed - fluent-ffmpeg archival — archived May 22 2025, confirmed
- Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)
- The Voice AI Stack for Building Agents (assemblyai.com)
- Telegram speech-to-text bot with Node.js (loonskai.com)
- grammY file handling guide —
ctx.getFile(), download pattern
Tertiary (LOW confidence — inferred from patterns)
- Dual output prompt reliability on 7B models — inferred from structured output community reports; not benchmarked on Hermes specifically
- Piper persistent HTTP wrapper — community pattern referenced;
piper-httppackage status not verified sanitizeForTTSutility pattern — inferred from TTS pipeline implementations; implementation detail not sourced from a canonical reference
Research completed: 2026-04-03 Ready for roadmap: yes