24 KiB
Feature Research
Domain: Voice Pipeline (Whisper STT + Piper TTS) + Telegram Bridge (Nexus v1.6) Researched: 2026-04-03 Confidence: MEDIUM-HIGH — STT/TTS pipeline patterns are well-documented; Telegram bot API is stable; dual-output formatting and voice mode UX patterns inferred from ChatGPT/Meta AI voice implementations and community patterns
Milestone Scope
This document covers only the NEW features in v1.6. The following are already built and are dependencies, not deliverables:
- VoiceRecordButton with MediaRecorder API in ChatInput (v1.3)
- TtsButton with @mintplex-labs/piper-tts-web WASM synthesis (v1.3/v1.5)
- POST /transcribe endpoint with whisper-cpp/openai-whisper cascade (v1.3)
- VoiceStep in onboarding wizard (v1.5)
- voiceEnabled in nexus-settings (v1.5)
- Full chat system with streaming SSE (v1.3)
New features being researched:
- Transport-agnostic voice pipeline (server-side, not just browser WASM)
- Voice mode flag on messages (affects response formatting)
- Dual output pattern: voice-optimized prose + full markdown text
- Web chat voice UI improvements: silence detection, waveform, auto-submit
- Web chat audio playback: inline player, auto-play toggle
- Voice mode toggle setting (text only / voice input / full voice)
- Minimal Telegram bridge: single bot, text + voice relay, agent prefixing
Feature Landscape
Table Stakes (Users Expect These)
Features users assume exist when voice or Telegram is mentioned. Missing these makes the feature feel broken or incomplete.
| Feature | Why Expected | Complexity | Notes |
|---|---|---|---|
| Silence-based auto-submit | Every voice input UI (Siri, Google, Whisper demos) stops recording on silence; holding a button feels archaic | MEDIUM | WebRTC VAD or AudioWorklet amplitude monitoring; 1.5s silence threshold typical; must show countdown so user knows what's happening |
| Waveform/amplitude visualization while recording | Users expect visual feedback that the mic is active; a static "recording..." text feels broken | LOW | Canvas or SVG with 30-50 data points; AnalyserNode from Web Audio API; real-time amplitude bars, not pre-rendered waveform |
| Voice response auto-play toggle | If the AI responded with audio, playing it automatically is expected unless the user disabled it; manual play-only feels incomplete | LOW | Boolean setting in nexus-settings (voiceAutoPlay); inline HTML5 <audio> element is sufficient; Web Audio API not needed |
| Markdown-free voice responses | Users who hear responses read aloud expect prose sentences, not "asterisk asterisk bold asterisk asterisk code block triple backtick" spoken aloud | MEDIUM | Requires voice mode flag on the message sent to LLM; system prompt addendum: "respond in natural spoken prose, no markdown symbols, no bullet points, no code blocks unless the user explicitly asks"; dual output requires separate LLM pass or post-processing strip |
| Telegram text relay to existing chat | Sending a text message to the Telegram bot and receiving the agent's reply is the core use case; anything less is not a bridge | MEDIUM | Telegraf (Node.js) as bot framework; message forwarded to existing chat API endpoint; response prefixed with agent name |
| Telegram voice message transcription | Telegram users frequently send voice notes; a bridge that ignores voice messages frustrates mobile users immediately | MEDIUM | Telegram sends voice as OGG/Opus; download → convert (ffmpeg) → POST /transcribe → forward text to agent → reply with text (+ optionally TTS audio back) |
| Agent identity visible in Telegram replies | When multiple agents can respond, the user must know who is replying | LOW | Simple text prefix: [Hermes] Your answer here; consistent format across all messages |
| Recording state visible in UI | Users must be able to tell when recording is active vs. idle vs. processing | LOW | Three states in mic button: idle (mic icon), recording (red pulsing), processing (spinner); state machine pattern |
Differentiators (Competitive Advantage)
Features that make v1.6's voice and Telegram features worth using, beyond baseline functionality.
| Feature | Value Proposition | Complexity | Notes |
|---|---|---|---|
| Transport-agnostic voice pipeline | Voice processing works identically for browser input, Telegram voice notes, and future CLI/API callers; no duplication of Whisper/Piper logic | MEDIUM | Abstract to a VoicePipelineService: transcribe(audioBuffer) → text, synthesize(text, voice?) → audioBuffer; HTTP endpoints call the service; Telegram bot calls the same service |
| Dual output pattern | AI responds with two representations: short spoken-prose version (for TTS/Telegram) and full markdown version (for web chat, copy-paste, code); user sees both where appropriate | HIGH | Prompt engineering: "Provide a SPOKEN response (1-3 sentences, no markdown) and a DETAILED response (full markdown). Format: SPOKEN: … DETAILED: …"; parse and split in middleware; store both in message metadata |
| Sentence-buffered TTS streaming | Start playing the first sentence while the second is still synthesizing; reduces perceived latency vs. waiting for full response | MEDIUM | Split response on .!?; Piper synthesizes sentence 1, audio starts playing; meanwhile sentence 2 begins synthesis; append chunks to audio queue |
| Voice mode flag preserves context | Messages tagged with voice_mode: true in the DB let the UI, Telegram bridge, and future Command Center all render correctly without re-inferring intent |
LOW | Add source field or voice_mode boolean to message metadata; already-existing message schema likely supports metadata/extras column |
| Telegram as thin relay (not a separate chat product) | The Telegram bot forwards to the existing Nexus chat engine; responses use the full agent intelligence already configured; no separate bot personality to maintain | LOW | Relay pattern: Telegram message → POST /api/workspaces/:id/chat/messages → SSE stream → collect full response → reply to Telegram; agent prefixing is presentation only |
| Language auto-detection in STT | Whisper natively detects language without configuration; relay this info back to the UI so the user knows what language was detected | LOW | Whisper returns language in its JSON output; pass through to transcript response; log in message metadata; no user config needed for common languages |
Anti-Features (Commonly Requested, Often Problematic)
| Feature | Why Requested | Why Problematic | Alternative |
|---|---|---|---|
| Real-time speech-to-speech streaming | Feels like a "next level" voice experience | Requires full-duplex WebSocket audio, interrupt handling, turn-taking logic, VAD on both ends — an entirely different architecture (Pipecat, LiveKit); out of scope for a relay bridge | Sequential pipeline (speak → wait → hear) is sufficient for assistant use cases; real-time is only needed for phone-call-style interaction |
| Per-agent Telegram bots | "My PM agent should have its own bot handle" | Multiple bots means multiple bot tokens, multiple webhook registrations, complex routing when agents hand off to each other; maintenance nightmare | Single bot with agent name prefix in messages: [PM] Here is your sprint plan; PROJECT.md explicitly out-of-scopes this |
| Deep Telegram ↔ web chat sync | "I want to see Telegram messages in the web UI" | Real-time bidirectional sync requires a shared event bus (Postgres LISTEN/NOTIFY or Redis pub/sub), session management across transports, and conflict resolution; PROJECT.md explicitly defers this to "Postgres bus" future milestone | Relay is one-way per session: Telegram message → agent → Telegram reply; web chat is a separate session |
| Wake word detection | "Hey Nexus, start recording" | Requires always-on microphone access, local wakeword model (Porcupine, OpenWakeWord), and careful battery/privacy handling; browser does not allow always-on mic | Mic button tap is sufficient; wake word is a future hardware device concern |
| Streaming TTS word-by-word | Feels maximally responsive | Browser audio playback of a stream of tiny WAV fragments causes clicks, gaps, and buffering issues; each Piper call has startup overhead; the sentence-buffered approach gives 95% of the benefit | Sentence-buffered playback (buffer on .!?); start playing sentence 1 while sentence 2 synthesizes |
| Inline code execution over Telegram | "I want to run tasks from Telegram" | Security: arbitrary code execution via an unauthenticated chat interface; scope: Telegram bridge is explicitly a thin relay, not a command interface | Support text and voice message relay only; task creation via conversational agent response is sufficient |
| GSD formatting / rich elements in Telegram | Telegram supports inline keyboards, threaded replies — use them | Telegram's formatting model (inline keyboards, callback queries) requires stateful session tracking; PROJECT.md explicitly out-of-scopes this | Plain text + Markdown v1 (which Telegram natively renders for bold/italic/code); no inline keyboards in v1.6 |
| Transcription editing before sending | "Let me see the transcript before it goes to the agent" | Adds a confirmation step that breaks the hands-free voice flow; most users trust auto-send after VAD silence detection; optionally show transcript as a message in the UI after the fact | Show the detected transcript in the chat message bubble with a small "mic" icon; no edit step |
Feature Dependencies
Transport-Agnostic VoicePipelineService
└──wraps──> Existing /transcribe endpoint (Whisper) [already built]
└──wraps──> Piper TTS binary/WASM [already built in browser; server-side is new]
└──consumed-by──> Web chat mic button (browser calls server or uses WASM directly)
└──consumed-by──> Telegram bridge (server-side calls VoicePipelineService)
└──consumed-by──> Future transports (CLI, API, Command Center)
Voice Mode Flag
└──set-by──> Web chat (user is in voice mode)
└──set-by──> Telegram bridge (message arrived as voice note)
└──consumed-by──> LLM prompt construction (appends no-markdown instruction)
└──consumed-by──> Dual output pattern (triggers two-response format)
└──consumed-by──> TTS synthesis (triggers auto-synthesis of response)
Dual Output Pattern
└──requires──> Voice mode flag (only triggers in voice mode)
└──requires──> LLM prompt engineering (structured SPOKEN/DETAILED format)
└──produces──> Short prose (for TTS, Telegram reply)
└──produces──> Full markdown (for web chat display, copy)
Web Chat Voice UI (silence detection + waveform)
└──requires──> Existing VoiceRecordButton [already built — enhance, not replace]
└──requires──> Web Audio API (AnalyserNode for amplitude) [browser built-in]
└──enhances──> Voice Mode Toggle (waveform only visible when voice mode active)
Web Chat Audio Playback
└──requires──> TTS synthesis output (WAV/MP3 audio buffer)
└──requires──> Voice mode flag (auto-play only in full voice mode)
└──independent──> waveform visualization (different UI component)
Telegram Bridge
└──requires──> VoicePipelineService (for voice note handling)
└──requires──> Existing chat API (POST /api/... for message relay)
└──requires──> ffmpeg (OGG/Opus → WAV conversion for Whisper)
└──requires──> Telegraf (Node.js bot framework)
└──independent──> web chat UI changes
Onboarding STT/TTS Detection
└──requires──> Existing VoiceStep [already built — update, not replace]
└──requires──> VoicePipelineService availability check
└──independent──> Telegram bridge
Dependency Notes
- VoicePipelineService is the keystone: Build this first. It abstracts Whisper + Piper behind a clean interface. Every other v1.6 feature is a consumer. If this is skipped, the Telegram bridge and web improvements become duplicate, divergent code.
- Voice mode flag must be stored on the message: Not just passed in memory. Future Command Center and Telegram both need to know retroactively whether a message was voice-originated.
- Dual output is optional on non-voice messages: Text-mode messages do not need the SPOKEN variant. The prompt injection and response parsing only apply when
voice_mode: true. - Telegram bridge has no UI: It's a server-side Node.js process (or Express route). No React changes needed for Telegram.
- ffmpeg is a hard dependency for Telegram voice notes: Telegram sends OGG/Opus; Whisper expects WAV/MP3. ffmpeg must be available on the server. On Mac Mini this is
brew install ffmpeg. - Web chat waveform enhances existing VoiceRecordButton: Do not replace it. The existing component handles MediaRecorder and send; add AudioWorklet/AnalyserNode visualization on top.
MVP Definition
Launch With (v1.6 Milestone)
Minimum viable set to make voice and Telegram genuinely useful, not just technically present.
- VoicePipelineService — Transport-agnostic server-side Whisper + Piper abstraction. Why essential: gates all other features; prevents code duplication between web and Telegram.
- Voice mode flag + dual output — LLM receives no-markdown instruction; response splits into spoken prose + full markdown. Why essential: spoken markdown sounds broken; this is what makes TTS usable.
- Web chat silence detection + auto-submit — Amplitude-based VAD stops recording automatically and submits. Why essential: hands-free voice only works if the user does not have to click "send."
- Web chat waveform visualization — Amplitude bars while recording. Why essential: without it, users cannot tell if the mic is picking up audio.
- Web chat audio playback with auto-play toggle — Agent voice responses play inline. Why essential: without playback, TTS synthesis has nowhere to go.
- Voice mode toggle setting — Three modes: text only / voice input only / full voice (input + output). Why essential: users need to control the modality per session.
- Telegram text relay — Text messages in → agent response out, with agent prefix. Why essential: core use case for phone access.
- Telegram voice note relay — Voice notes in → transcribe → agent → text reply. Why essential: mobile Telegram users default to voice notes.
Add After Validation (v1.6.x)
- Telegram TTS reply option — Agent response synthesized and sent back as an OGG voice note. Trigger: user feedback that text replies are too long to read on phone.
- Sentence-buffered TTS streaming — Start audio playback before full synthesis completes. Trigger: latency complaints with longer responses.
- Voice response history in UI — Chat messages show audio player for past synthesized responses (not just the current one). Trigger: users want to replay previous responses.
Future Consideration (v2+)
- Real-time speech-to-speech — Full-duplex conversation; requires Pipecat or LiveKit; entirely different architecture.
- Wake word detection — Always-on mic, local wakeword model; hardware device concern.
- Deep Telegram ↔ web sync — Bidirectional session mirroring via Postgres bus; deferred per PROJECT.md.
- Per-transport voice models — Different Piper voice for Telegram vs. web (e.g., cleaner phone voice vs. natural assistant voice).
Feature Prioritization Matrix
| Feature | User Value | Implementation Cost | Priority |
|---|---|---|---|
| VoicePipelineService | HIGH | MEDIUM | P1 |
| Voice mode flag + dual output | HIGH | MEDIUM | P1 |
| Silence detection + auto-submit | HIGH | MEDIUM | P1 |
| Waveform visualization | MEDIUM | LOW | P1 |
| Audio playback + auto-play toggle | HIGH | LOW | P1 |
| Voice mode toggle setting | HIGH | LOW | P1 |
| Telegram text relay | HIGH | MEDIUM | P1 |
| Telegram voice note relay | HIGH | MEDIUM | P1 |
| Telegram TTS reply | MEDIUM | MEDIUM | P2 |
| Sentence-buffered TTS streaming | MEDIUM | MEDIUM | P2 |
| Voice response history | LOW | MEDIUM | P3 |
| Real-time speech-to-speech | HIGH | HIGH | P3 (v2+) |
Priority key:
- P1: Must have for v1.6 launch
- P2: Should have, add in v1.6.x
- P3: Nice to have, v2+
Competitor Feature Analysis
| Feature | ChatGPT Voice Mode | Telegram + other bots | Nexus v1.6 Approach |
|---|---|---|---|
| STT | Whisper (cloud) | Per-bot (usually cloud) | Whisper local, CPU fallback |
| TTS | Custom neural (cloud) | gTTS or ElevenLabs | Piper local, CPU-only |
| Markdown-free voice | Yes (GPT strips markdown) | Usually not (bots send raw markdown) | Dual output: SPOKEN + DETAILED |
| Silence detection | Yes (VAD, full-duplex) | N/A | Amplitude VAD, 1.5s threshold |
| Waveform UI | Animated blobs (not literal waveform) | N/A | AnalyserNode amplitude bars |
| Agent identity in replies | N/A (single assistant) | Custom per bot | Text prefix [AgentName] |
| Telegram voice note support | N/A | Varies widely | OGG→WAV→Whisper→agent |
| Offline / local operation | No | No | Fully local: Whisper + Piper + Ollama |
| Transport abstraction | N/A | N/A | VoicePipelineService (web + Telegram share same service) |
Voice Pipeline Architecture Notes
Confidence: HIGH for the cascading/sequential pipeline; MEDIUM for dual output prompt engineering reliability.
Sequential Pipeline (chosen architecture for v1.6)
[Browser/Telegram]
|
| audio buffer (WAV/OGG)
v
VoicePipelineService.transcribe()
|
| transcript text + language + confidence
v
LLM (with voice_mode prompt addendum)
|
| structured response: SPOKEN: "..." DETAILED: "..."
v
Response parser → { spoken: string, detailed: string }
| |
| v
| Web chat: render detailed (markdown)
| Telegram: send spoken as text
v
VoicePipelineService.synthesize(spoken)
|
| WAV audio buffer
v
Web chat: <audio> element autoplay
Telegram (v2): sendVoice() as OGG/Opus
Why not real-time speech-to-speech:
Real-time requires full-duplex WebSocket audio, interrupt detection (barge-in), turn-taking state machine, and sub-200ms latency budgets. The sequential pattern targets <3s end-to-end on Apple Silicon M4, which is appropriate for assistant interactions (not phone calls). The complexity delta is enormous; PROJECT.md explicitly defers this.
Telegram Bridge Architecture Notes
Confidence: HIGH — Telegraf is the standard Node.js Telegram framework; patterns are well-established.
Single Bot, Agent Prefix Pattern
Telegram user sends: "What's the status of the Nexus project?"
|
Telegraf handler
|
POST /api/workspaces/:id/chat/messages
{ content: "What's the status...", source: "telegram", voice_mode: false }
|
SSE stream → collect until [DONE]
|
bot.sendMessage(chatId, "[Hermes] The Nexus project is currently...")
Voice Note Flow
Telegram user sends voice note (OGG/Opus, ~15s)
|
Telegraf voice handler: bot.getFile() → download OGG
|
ffmpeg: OGG → WAV (16kHz mono)
|
VoicePipelineService.transcribe(wavBuffer)
|
POST /api/workspaces/:id/chat/messages
{ content: transcript, source: "telegram", voice_mode: true }
|
Collect SSE stream → spoken variant of response
|
bot.sendMessage(chatId, "[Hermes] " + spokenResponse)
// v2: bot.sendVoice(chatId, synthesizedOggBuffer)
Key implementation decisions:
- Polling vs. webhooks: Webhooks require a public HTTPS endpoint. For Mac Mini on home network, long polling is the correct choice. Telegraf supports both; use
bot.launch()(polling mode) for v1.6. - Bot token storage: Environment variable
TELEGRAM_BOT_TOKEN; added to.envand loaded via existing env config pattern. - Authorized users only: Store allowed Telegram user IDs or usernames in nexus-settings to prevent unauthorized access; a bridge with no auth is a security hole.
- Conversation context: Each Telegram chat ID maps to a Nexus workspace session; maintain a
telegramChatId → workspaceId + conversationIdmapping in a lightweight in-memory store or SQLite table.
Voice Mode Response Formatting Notes
Confidence: MEDIUM — dual output prompt pattern is used in production systems but prompt reliability varies by model; post-processing strip is more reliable.
Two approaches, use both as fallback:
Approach A: Prompt-based dual output (preferred)
Append to system prompt when voice_mode: true:
When responding, provide two versions:
SPOKEN: [1-3 sentences in natural spoken prose, no markdown, no symbols, no lists]
DETAILED: [Full response with markdown formatting, code blocks, bullet points as needed]
Parse response: split on SPOKEN: and DETAILED: markers.
Approach B: Post-processing strip (fallback) If the model doesn't follow the dual output format, post-process the full response:
- Strip
**bold**→ "bold" - Strip
`code`→ "code" - Strip
# headers→ remove#prefix - Strip
-bullet points → convert to sentences or strip - Strip
code blocks→ summarize as "[code example]" or remove entirely Use as the spoken variant. The full original markdown response is the detailed variant.
Reliable rule: Never read markdown symbols aloud. Either approach prevents this; dual output is preferred because it lets the LLM choose better phrasing for spoken delivery (short, natural sentences vs. information-dense bullets).
Sources
- Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)
- The Voice AI Stack for Building Agents (assemblyai.com)
- One-Second Voice-to-Voice Latency with Modal, Pipecat, and Open Models (modal.com)
- Voice Chat with Local LLMs: Whisper + TTS (insiderllm.com)
- whisper-cpp VAD (ggml-org/whisper.cpp on GitHub)
- Telegram Bot API — sendVoice (core.telegram.org)
- Convert Voice Memos from Telegram to Text using OpenAI Whisper (dev.to)
- Telegram speech-to-text bot with Node.js (loonskai.com)
- Telegraf: Modern Telegram Bot Framework for Node.js (telegraf.js.org)
- HA Voice PE markdown post-processing discussion (community.home-assistant.io)
- Two design patterns for Telegram Bots (dev.to/madhead)
- Design voice AI experiences — LiveKit Agents UI (livekit.com)
- User Interaction Patterns in LLM-Powered Voice Assistants (arxiv.org)
- voicegram PyPI — OGG/Opus conversion (pypi.org)
Feature research for: Nexus v1.6 Voice Pipeline + Minimal Telegram Bridge Researched: 2026-04-03