nexus/.planning/research/FEATURES.md
2026-04-03 23:53:14 +00:00

337 lines
24 KiB
Markdown

# Feature Research
**Domain:** Voice Pipeline (Whisper STT + Piper TTS) + Telegram Bridge (Nexus v1.6)
**Researched:** 2026-04-03
**Confidence:** MEDIUM-HIGH — STT/TTS pipeline patterns are well-documented; Telegram bot API is stable; dual-output formatting and voice mode UX patterns inferred from ChatGPT/Meta AI voice implementations and community patterns
---
## Milestone Scope
This document covers only the NEW features in v1.6. The following are already built and are dependencies, not deliverables:
- VoiceRecordButton with MediaRecorder API in ChatInput (v1.3)
- TtsButton with @mintplex-labs/piper-tts-web WASM synthesis (v1.3/v1.5)
- POST /transcribe endpoint with whisper-cpp/openai-whisper cascade (v1.3)
- VoiceStep in onboarding wizard (v1.5)
- voiceEnabled in nexus-settings (v1.5)
- Full chat system with streaming SSE (v1.3)
**New features being researched:**
- Transport-agnostic voice pipeline (server-side, not just browser WASM)
- Voice mode flag on messages (affects response formatting)
- Dual output pattern: voice-optimized prose + full markdown text
- Web chat voice UI improvements: silence detection, waveform, auto-submit
- Web chat audio playback: inline player, auto-play toggle
- Voice mode toggle setting (text only / voice input / full voice)
- Minimal Telegram bridge: single bot, text + voice relay, agent prefixing
---
## Feature Landscape
### Table Stakes (Users Expect These)
Features users assume exist when voice or Telegram is mentioned. Missing these makes the feature feel broken or incomplete.
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Silence-based auto-submit | Every voice input UI (Siri, Google, Whisper demos) stops recording on silence; holding a button feels archaic | MEDIUM | WebRTC VAD or AudioWorklet amplitude monitoring; 1.5s silence threshold typical; must show countdown so user knows what's happening |
| Waveform/amplitude visualization while recording | Users expect visual feedback that the mic is active; a static "recording..." text feels broken | LOW | Canvas or SVG with 30-50 data points; AnalyserNode from Web Audio API; real-time amplitude bars, not pre-rendered waveform |
| Voice response auto-play toggle | If the AI responded with audio, playing it automatically is expected unless the user disabled it; manual play-only feels incomplete | LOW | Boolean setting in nexus-settings (voiceAutoPlay); inline HTML5 `<audio>` element is sufficient; Web Audio API not needed |
| Markdown-free voice responses | Users who hear responses read aloud expect prose sentences, not "asterisk asterisk bold asterisk asterisk code block triple backtick" spoken aloud | MEDIUM | Requires voice mode flag on the message sent to LLM; system prompt addendum: "respond in natural spoken prose, no markdown symbols, no bullet points, no code blocks unless the user explicitly asks"; dual output requires separate LLM pass or post-processing strip |
| Telegram text relay to existing chat | Sending a text message to the Telegram bot and receiving the agent's reply is the core use case; anything less is not a bridge | MEDIUM | Telegraf (Node.js) as bot framework; message forwarded to existing chat API endpoint; response prefixed with agent name |
| Telegram voice message transcription | Telegram users frequently send voice notes; a bridge that ignores voice messages frustrates mobile users immediately | MEDIUM | Telegram sends voice as OGG/Opus; download → convert (ffmpeg) → POST /transcribe → forward text to agent → reply with text (+ optionally TTS audio back) |
| Agent identity visible in Telegram replies | When multiple agents can respond, the user must know who is replying | LOW | Simple text prefix: `[Hermes] Your answer here`; consistent format across all messages |
| Recording state visible in UI | Users must be able to tell when recording is active vs. idle vs. processing | LOW | Three states in mic button: idle (mic icon), recording (red pulsing), processing (spinner); state machine pattern |
### Differentiators (Competitive Advantage)
Features that make v1.6's voice and Telegram features worth using, beyond baseline functionality.
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Transport-agnostic voice pipeline | Voice processing works identically for browser input, Telegram voice notes, and future CLI/API callers; no duplication of Whisper/Piper logic | MEDIUM | Abstract to a `VoicePipelineService`: `transcribe(audioBuffer) → text`, `synthesize(text, voice?) → audioBuffer`; HTTP endpoints call the service; Telegram bot calls the same service |
| Dual output pattern | AI responds with two representations: short spoken-prose version (for TTS/Telegram) and full markdown version (for web chat, copy-paste, code); user sees both where appropriate | HIGH | Prompt engineering: "Provide a SPOKEN response (1-3 sentences, no markdown) and a DETAILED response (full markdown). Format: SPOKEN: … DETAILED: …"; parse and split in middleware; store both in message metadata |
| Sentence-buffered TTS streaming | Start playing the first sentence while the second is still synthesizing; reduces perceived latency vs. waiting for full response | MEDIUM | Split response on `.!?`; Piper synthesizes sentence 1, audio starts playing; meanwhile sentence 2 begins synthesis; append chunks to audio queue |
| Voice mode flag preserves context | Messages tagged with `voice_mode: true` in the DB let the UI, Telegram bridge, and future Command Center all render correctly without re-inferring intent | LOW | Add `source` field or `voice_mode` boolean to message metadata; already-existing message schema likely supports metadata/extras column |
| Telegram as thin relay (not a separate chat product) | The Telegram bot forwards to the existing Nexus chat engine; responses use the full agent intelligence already configured; no separate bot personality to maintain | LOW | Relay pattern: Telegram message → POST /api/workspaces/:id/chat/messages → SSE stream → collect full response → reply to Telegram; agent prefixing is presentation only |
| Language auto-detection in STT | Whisper natively detects language without configuration; relay this info back to the UI so the user knows what language was detected | LOW | Whisper returns `language` in its JSON output; pass through to transcript response; log in message metadata; no user config needed for common languages |
### Anti-Features (Commonly Requested, Often Problematic)
| Feature | Why Requested | Why Problematic | Alternative |
|---------|---------------|-----------------|-------------|
| Real-time speech-to-speech streaming | Feels like a "next level" voice experience | Requires full-duplex WebSocket audio, interrupt handling, turn-taking logic, VAD on both ends — an entirely different architecture (Pipecat, LiveKit); out of scope for a relay bridge | Sequential pipeline (speak → wait → hear) is sufficient for assistant use cases; real-time is only needed for phone-call-style interaction |
| Per-agent Telegram bots | "My PM agent should have its own bot handle" | Multiple bots means multiple bot tokens, multiple webhook registrations, complex routing when agents hand off to each other; maintenance nightmare | Single bot with agent name prefix in messages: `[PM] Here is your sprint plan`; PROJECT.md explicitly out-of-scopes this |
| Deep Telegram ↔ web chat sync | "I want to see Telegram messages in the web UI" | Real-time bidirectional sync requires a shared event bus (Postgres LISTEN/NOTIFY or Redis pub/sub), session management across transports, and conflict resolution; PROJECT.md explicitly defers this to "Postgres bus" future milestone | Relay is one-way per session: Telegram message → agent → Telegram reply; web chat is a separate session |
| Wake word detection | "Hey Nexus, start recording" | Requires always-on microphone access, local wakeword model (Porcupine, OpenWakeWord), and careful battery/privacy handling; browser does not allow always-on mic | Mic button tap is sufficient; wake word is a future hardware device concern |
| Streaming TTS word-by-word | Feels maximally responsive | Browser audio playback of a stream of tiny WAV fragments causes clicks, gaps, and buffering issues; each Piper call has startup overhead; the sentence-buffered approach gives 95% of the benefit | Sentence-buffered playback (buffer on `.!?`); start playing sentence 1 while sentence 2 synthesizes |
| Inline code execution over Telegram | "I want to run tasks from Telegram" | Security: arbitrary code execution via an unauthenticated chat interface; scope: Telegram bridge is explicitly a thin relay, not a command interface | Support text and voice message relay only; task creation via conversational agent response is sufficient |
| GSD formatting / rich elements in Telegram | Telegram supports inline keyboards, threaded replies — use them | Telegram's formatting model (inline keyboards, callback queries) requires stateful session tracking; PROJECT.md explicitly out-of-scopes this | Plain text + Markdown v1 (which Telegram natively renders for bold/italic/code); no inline keyboards in v1.6 |
| Transcription editing before sending | "Let me see the transcript before it goes to the agent" | Adds a confirmation step that breaks the hands-free voice flow; most users trust auto-send after VAD silence detection; optionally show transcript as a message in the UI after the fact | Show the detected transcript in the chat message bubble with a small "mic" icon; no edit step |
---
## Feature Dependencies
```
Transport-Agnostic VoicePipelineService
└──wraps──> Existing /transcribe endpoint (Whisper) [already built]
└──wraps──> Piper TTS binary/WASM [already built in browser; server-side is new]
└──consumed-by──> Web chat mic button (browser calls server or uses WASM directly)
└──consumed-by──> Telegram bridge (server-side calls VoicePipelineService)
└──consumed-by──> Future transports (CLI, API, Command Center)
Voice Mode Flag
└──set-by──> Web chat (user is in voice mode)
└──set-by──> Telegram bridge (message arrived as voice note)
└──consumed-by──> LLM prompt construction (appends no-markdown instruction)
└──consumed-by──> Dual output pattern (triggers two-response format)
└──consumed-by──> TTS synthesis (triggers auto-synthesis of response)
Dual Output Pattern
└──requires──> Voice mode flag (only triggers in voice mode)
└──requires──> LLM prompt engineering (structured SPOKEN/DETAILED format)
└──produces──> Short prose (for TTS, Telegram reply)
└──produces──> Full markdown (for web chat display, copy)
Web Chat Voice UI (silence detection + waveform)
└──requires──> Existing VoiceRecordButton [already built — enhance, not replace]
└──requires──> Web Audio API (AnalyserNode for amplitude) [browser built-in]
└──enhances──> Voice Mode Toggle (waveform only visible when voice mode active)
Web Chat Audio Playback
└──requires──> TTS synthesis output (WAV/MP3 audio buffer)
└──requires──> Voice mode flag (auto-play only in full voice mode)
└──independent──> waveform visualization (different UI component)
Telegram Bridge
└──requires──> VoicePipelineService (for voice note handling)
└──requires──> Existing chat API (POST /api/... for message relay)
└──requires──> ffmpeg (OGG/Opus → WAV conversion for Whisper)
└──requires──> Telegraf (Node.js bot framework)
└──independent──> web chat UI changes
Onboarding STT/TTS Detection
└──requires──> Existing VoiceStep [already built — update, not replace]
└──requires──> VoicePipelineService availability check
└──independent──> Telegram bridge
```
### Dependency Notes
- **VoicePipelineService is the keystone:** Build this first. It abstracts Whisper + Piper behind a clean interface. Every other v1.6 feature is a consumer. If this is skipped, the Telegram bridge and web improvements become duplicate, divergent code.
- **Voice mode flag must be stored on the message:** Not just passed in memory. Future Command Center and Telegram both need to know retroactively whether a message was voice-originated.
- **Dual output is optional on non-voice messages:** Text-mode messages do not need the SPOKEN variant. The prompt injection and response parsing only apply when `voice_mode: true`.
- **Telegram bridge has no UI:** It's a server-side Node.js process (or Express route). No React changes needed for Telegram.
- **ffmpeg is a hard dependency for Telegram voice notes:** Telegram sends OGG/Opus; Whisper expects WAV/MP3. ffmpeg must be available on the server. On Mac Mini this is `brew install ffmpeg`.
- **Web chat waveform enhances existing VoiceRecordButton:** Do not replace it. The existing component handles MediaRecorder and send; add AudioWorklet/AnalyserNode visualization on top.
---
## MVP Definition
### Launch With (v1.6 Milestone)
Minimum viable set to make voice and Telegram genuinely useful, not just technically present.
- [ ] **VoicePipelineService** — Transport-agnostic server-side Whisper + Piper abstraction. Why essential: gates all other features; prevents code duplication between web and Telegram.
- [ ] **Voice mode flag + dual output** — LLM receives no-markdown instruction; response splits into spoken prose + full markdown. Why essential: spoken markdown sounds broken; this is what makes TTS usable.
- [ ] **Web chat silence detection + auto-submit** — Amplitude-based VAD stops recording automatically and submits. Why essential: hands-free voice only works if the user does not have to click "send."
- [ ] **Web chat waveform visualization** — Amplitude bars while recording. Why essential: without it, users cannot tell if the mic is picking up audio.
- [ ] **Web chat audio playback with auto-play toggle** — Agent voice responses play inline. Why essential: without playback, TTS synthesis has nowhere to go.
- [ ] **Voice mode toggle setting** — Three modes: text only / voice input only / full voice (input + output). Why essential: users need to control the modality per session.
- [ ] **Telegram text relay** — Text messages in → agent response out, with agent prefix. Why essential: core use case for phone access.
- [ ] **Telegram voice note relay** — Voice notes in → transcribe → agent → text reply. Why essential: mobile Telegram users default to voice notes.
### Add After Validation (v1.6.x)
- [ ] **Telegram TTS reply option** — Agent response synthesized and sent back as an OGG voice note. Trigger: user feedback that text replies are too long to read on phone.
- [ ] **Sentence-buffered TTS streaming** — Start audio playback before full synthesis completes. Trigger: latency complaints with longer responses.
- [ ] **Voice response history in UI** — Chat messages show audio player for past synthesized responses (not just the current one). Trigger: users want to replay previous responses.
### Future Consideration (v2+)
- [ ] **Real-time speech-to-speech** — Full-duplex conversation; requires Pipecat or LiveKit; entirely different architecture.
- [ ] **Wake word detection** — Always-on mic, local wakeword model; hardware device concern.
- [ ] **Deep Telegram ↔ web sync** — Bidirectional session mirroring via Postgres bus; deferred per PROJECT.md.
- [ ] **Per-transport voice models** — Different Piper voice for Telegram vs. web (e.g., cleaner phone voice vs. natural assistant voice).
---
## Feature Prioritization Matrix
| Feature | User Value | Implementation Cost | Priority |
|---------|------------|---------------------|----------|
| VoicePipelineService | HIGH | MEDIUM | P1 |
| Voice mode flag + dual output | HIGH | MEDIUM | P1 |
| Silence detection + auto-submit | HIGH | MEDIUM | P1 |
| Waveform visualization | MEDIUM | LOW | P1 |
| Audio playback + auto-play toggle | HIGH | LOW | P1 |
| Voice mode toggle setting | HIGH | LOW | P1 |
| Telegram text relay | HIGH | MEDIUM | P1 |
| Telegram voice note relay | HIGH | MEDIUM | P1 |
| Telegram TTS reply | MEDIUM | MEDIUM | P2 |
| Sentence-buffered TTS streaming | MEDIUM | MEDIUM | P2 |
| Voice response history | LOW | MEDIUM | P3 |
| Real-time speech-to-speech | HIGH | HIGH | P3 (v2+) |
**Priority key:**
- P1: Must have for v1.6 launch
- P2: Should have, add in v1.6.x
- P3: Nice to have, v2+
---
## Competitor Feature Analysis
| Feature | ChatGPT Voice Mode | Telegram + other bots | Nexus v1.6 Approach |
|---------|--------------------|-----------------------|---------------------|
| STT | Whisper (cloud) | Per-bot (usually cloud) | Whisper local, CPU fallback |
| TTS | Custom neural (cloud) | gTTS or ElevenLabs | Piper local, CPU-only |
| Markdown-free voice | Yes (GPT strips markdown) | Usually not (bots send raw markdown) | Dual output: SPOKEN + DETAILED |
| Silence detection | Yes (VAD, full-duplex) | N/A | Amplitude VAD, 1.5s threshold |
| Waveform UI | Animated blobs (not literal waveform) | N/A | AnalyserNode amplitude bars |
| Agent identity in replies | N/A (single assistant) | Custom per bot | Text prefix `[AgentName]` |
| Telegram voice note support | N/A | Varies widely | OGG→WAV→Whisper→agent |
| Offline / local operation | No | No | Fully local: Whisper + Piper + Ollama |
| Transport abstraction | N/A | N/A | VoicePipelineService (web + Telegram share same service) |
---
## Voice Pipeline Architecture Notes
**Confidence:** HIGH for the cascading/sequential pipeline; MEDIUM for dual output prompt engineering reliability.
### Sequential Pipeline (chosen architecture for v1.6)
```
[Browser/Telegram]
|
| audio buffer (WAV/OGG)
v
VoicePipelineService.transcribe()
|
| transcript text + language + confidence
v
LLM (with voice_mode prompt addendum)
|
| structured response: SPOKEN: "..." DETAILED: "..."
v
Response parser → { spoken: string, detailed: string }
| |
| v
| Web chat: render detailed (markdown)
| Telegram: send spoken as text
v
VoicePipelineService.synthesize(spoken)
|
| WAV audio buffer
v
Web chat: <audio> element autoplay
Telegram (v2): sendVoice() as OGG/Opus
```
### Why not real-time speech-to-speech:
Real-time requires full-duplex WebSocket audio, interrupt detection (barge-in), turn-taking state machine, and sub-200ms latency budgets. The sequential pattern targets <3s end-to-end on Apple Silicon M4, which is appropriate for assistant interactions (not phone calls). The complexity delta is enormous; PROJECT.md explicitly defers this.
---
## Telegram Bridge Architecture Notes
**Confidence:** HIGH Telegraf is the standard Node.js Telegram framework; patterns are well-established.
### Single Bot, Agent Prefix Pattern
```
Telegram user sends: "What's the status of the Nexus project?"
|
Telegraf handler
|
POST /api/workspaces/:id/chat/messages
{ content: "What's the status...", source: "telegram", voice_mode: false }
|
SSE stream → collect until [DONE]
|
bot.sendMessage(chatId, "[Hermes] The Nexus project is currently...")
```
### Voice Note Flow
```
Telegram user sends voice note (OGG/Opus, ~15s)
|
Telegraf voice handler: bot.getFile() → download OGG
|
ffmpeg: OGG → WAV (16kHz mono)
|
VoicePipelineService.transcribe(wavBuffer)
|
POST /api/workspaces/:id/chat/messages
{ content: transcript, source: "telegram", voice_mode: true }
|
Collect SSE stream → spoken variant of response
|
bot.sendMessage(chatId, "[Hermes] " + spokenResponse)
// v2: bot.sendVoice(chatId, synthesizedOggBuffer)
```
### Key implementation decisions:
- **Polling vs. webhooks:** Webhooks require a public HTTPS endpoint. For Mac Mini on home network, long polling is the correct choice. Telegraf supports both; use `bot.launch()` (polling mode) for v1.6.
- **Bot token storage:** Environment variable `TELEGRAM_BOT_TOKEN`; added to `.env` and loaded via existing env config pattern.
- **Authorized users only:** Store allowed Telegram user IDs or usernames in nexus-settings to prevent unauthorized access; a bridge with no auth is a security hole.
- **Conversation context:** Each Telegram chat ID maps to a Nexus workspace session; maintain a `telegramChatId → workspaceId + conversationId` mapping in a lightweight in-memory store or SQLite table.
---
## Voice Mode Response Formatting Notes
**Confidence:** MEDIUM dual output prompt pattern is used in production systems but prompt reliability varies by model; post-processing strip is more reliable.
### Two approaches, use both as fallback:
**Approach A: Prompt-based dual output (preferred)**
Append to system prompt when `voice_mode: true`:
```
When responding, provide two versions:
SPOKEN: [1-3 sentences in natural spoken prose, no markdown, no symbols, no lists]
DETAILED: [Full response with markdown formatting, code blocks, bullet points as needed]
```
Parse response: split on `SPOKEN:` and `DETAILED:` markers.
**Approach B: Post-processing strip (fallback)**
If the model doesn't follow the dual output format, post-process the full response:
- Strip `**bold**` "bold"
- Strip `` `code` `` "code"
- Strip `# headers` remove `#` prefix
- Strip `- ` bullet points convert to sentences or strip
- Strip ``` code blocks ``` summarize as "[code example]" or remove entirely
Use as the spoken variant. The full original markdown response is the detailed variant.
**Reliable rule:** Never read markdown symbols aloud. Either approach prevents this; dual output is preferred because it lets the LLM choose better phrasing for spoken delivery (short, natural sentences vs. information-dense bullets).
---
## Sources
- [Real-Time vs Turn-Based STT/TTS Voice Agent Architecture (softcery.com)](https://softcery.com/lab/ai-voice-agents-real-time-vs-turn-based-tts-stt-architecture)
- [The Voice AI Stack for Building Agents (assemblyai.com)](https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents)
- [One-Second Voice-to-Voice Latency with Modal, Pipecat, and Open Models (modal.com)](https://modal.com/blog/low-latency-voice-bot)
- [Voice Chat with Local LLMs: Whisper + TTS (insiderllm.com)](https://www.insiderllm.com/guides/voice-chat-local-llms-whisper-tts/)
- [whisper-cpp VAD (ggml-org/whisper.cpp on GitHub)](https://github.com/ggml-org/whisper.cpp)
- [Telegram Bot API — sendVoice (core.telegram.org)](https://core.telegram.org/bots/api)
- [Convert Voice Memos from Telegram to Text using OpenAI Whisper (dev.to)](https://dev.to/techresolve/solved-convert-voice-memos-from-telegram-to-text-using-openai-whisper-api-41al)
- [Telegram speech-to-text bot with Node.js (loonskai.com)](https://www.loonskai.com/blog/telegram-speech-to-text-bot-with-nodejs)
- [Telegraf: Modern Telegram Bot Framework for Node.js (telegraf.js.org)](https://telegraf.js.org/)
- [HA Voice PE markdown post-processing discussion (community.home-assistant.io)](https://community.home-assistant.io/t/ha-voice-pe-add-post-processing-step-between-conversation-agent-and-speech-to-text-step/893933)
- [Two design patterns for Telegram Bots (dev.to/madhead)](https://dev.to/madhead/two-design-patterns-for-telegram-bots-59f5)
- [Design voice AI experiences — LiveKit Agents UI (livekit.com)](https://livekit.com/blog/design-voice-ai-interfaces-with-agents-ui)
- [User Interaction Patterns in LLM-Powered Voice Assistants (arxiv.org)](https://arxiv.org/html/2309.13879v2)
- [voicegram PyPI — OGG/Opus conversion (pypi.org)](https://pypi.org/project/voicegram/)
---
*Feature research for: Nexus v1.6 Voice Pipeline + Minimal Telegram Bridge*
*Researched: 2026-04-03*