Nexus Dev f6da67ecf4 docs: complete project research

2026-04-04 03:55:49 +00:00

24 KiB

Raw Blame History

Feature Research

Domain: Voice Pipeline (Whisper STT + Piper TTS) + Telegram Bridge (Nexus v1.6) Researched: 2026-04-03 Confidence: MEDIUM-HIGH — STT/TTS pipeline patterns are well-documented; Telegram bot API is stable; dual-output formatting and voice mode UX patterns inferred from ChatGPT/Meta AI voice implementations and community patterns

Milestone Scope

This document covers only the NEW features in v1.6. The following are already built and are dependencies, not deliverables:

VoiceRecordButton with MediaRecorder API in ChatInput (v1.3)
TtsButton with @mintplex-labs/piper-tts-web WASM synthesis (v1.3/v1.5)
POST /transcribe endpoint with whisper-cpp/openai-whisper cascade (v1.3)
VoiceStep in onboarding wizard (v1.5)
voiceEnabled in nexus-settings (v1.5)
Full chat system with streaming SSE (v1.3)

New features being researched:

Transport-agnostic voice pipeline (server-side, not just browser WASM)
Voice mode flag on messages (affects response formatting)
Dual output pattern: voice-optimized prose + full markdown text
Web chat voice UI improvements: silence detection, waveform, auto-submit
Web chat audio playback: inline player, auto-play toggle
Voice mode toggle setting (text only / voice input / full voice)
Minimal Telegram bridge: single bot, text + voice relay, agent prefixing

Feature Landscape

Table Stakes (Users Expect These)

Features users assume exist when voice or Telegram is mentioned. Missing these makes the feature feel broken or incomplete.

Feature	Why Expected	Complexity	Notes
Silence-based auto-submit	Every voice input UI (Siri, Google, Whisper demos) stops recording on silence; holding a button feels archaic	MEDIUM	WebRTC VAD or AudioWorklet amplitude monitoring; 1.5s silence threshold typical; must show countdown so user knows what's happening
Waveform/amplitude visualization while recording	Users expect visual feedback that the mic is active; a static "recording..." text feels broken	LOW	Canvas or SVG with 30-50 data points; AnalyserNode from Web Audio API; real-time amplitude bars, not pre-rendered waveform
Voice response auto-play toggle	If the AI responded with audio, playing it automatically is expected unless the user disabled it; manual play-only feels incomplete	LOW	Boolean setting in nexus-settings (voiceAutoPlay); inline HTML5 `<audio>` element is sufficient; Web Audio API not needed
Markdown-free voice responses	Users who hear responses read aloud expect prose sentences, not "asterisk asterisk bold asterisk asterisk code block triple backtick" spoken aloud	MEDIUM	Requires voice mode flag on the message sent to LLM; system prompt addendum: "respond in natural spoken prose, no markdown symbols, no bullet points, no code blocks unless the user explicitly asks"; dual output requires separate LLM pass or post-processing strip
Telegram text relay to existing chat	Sending a text message to the Telegram bot and receiving the agent's reply is the core use case; anything less is not a bridge	MEDIUM	Telegraf (Node.js) as bot framework; message forwarded to existing chat API endpoint; response prefixed with agent name
Telegram voice message transcription	Telegram users frequently send voice notes; a bridge that ignores voice messages frustrates mobile users immediately	MEDIUM	Telegram sends voice as OGG/Opus; download → convert (ffmpeg) → POST /transcribe → forward text to agent → reply with text (+ optionally TTS audio back)
Agent identity visible in Telegram replies	When multiple agents can respond, the user must know who is replying	LOW	Simple text prefix: `[Hermes] Your answer here`; consistent format across all messages
Recording state visible in UI	Users must be able to tell when recording is active vs. idle vs. processing	LOW	Three states in mic button: idle (mic icon), recording (red pulsing), processing (spinner); state machine pattern

Differentiators (Competitive Advantage)

Features that make v1.6's voice and Telegram features worth using, beyond baseline functionality.

Feature	Value Proposition	Complexity	Notes
Transport-agnostic voice pipeline	Voice processing works identically for browser input, Telegram voice notes, and future CLI/API callers; no duplication of Whisper/Piper logic	MEDIUM	Abstract to a `VoicePipelineService`: `transcribe(audioBuffer) → text`, `synthesize(text, voice?) → audioBuffer`; HTTP endpoints call the service; Telegram bot calls the same service
Dual output pattern	AI responds with two representations: short spoken-prose version (for TTS/Telegram) and full markdown version (for web chat, copy-paste, code); user sees both where appropriate	HIGH	Prompt engineering: "Provide a SPOKEN response (1-3 sentences, no markdown) and a DETAILED response (full markdown). Format: SPOKEN: … DETAILED: …"; parse and split in middleware; store both in message metadata
Sentence-buffered TTS streaming	Start playing the first sentence while the second is still synthesizing; reduces perceived latency vs. waiting for full response	MEDIUM	Split response on `.!?`; Piper synthesizes sentence 1, audio starts playing; meanwhile sentence 2 begins synthesis; append chunks to audio queue
Voice mode flag preserves context	Messages tagged with `voice_mode: true` in the DB let the UI, Telegram bridge, and future Command Center all render correctly without re-inferring intent	LOW	Add `source` field or `voice_mode` boolean to message metadata; already-existing message schema likely supports metadata/extras column
Telegram as thin relay (not a separate chat product)	The Telegram bot forwards to the existing Nexus chat engine; responses use the full agent intelligence already configured; no separate bot personality to maintain	LOW	Relay pattern: Telegram message → POST /api/workspaces/:id/chat/messages → SSE stream → collect full response → reply to Telegram; agent prefixing is presentation only
Language auto-detection in STT	Whisper natively detects language without configuration; relay this info back to the UI so the user knows what language was detected	LOW	Whisper returns `language` in its JSON output; pass through to transcript response; log in message metadata; no user config needed for common languages

Anti-Features (Commonly Requested, Often Problematic)

Feature	Why Requested	Why Problematic	Alternative
Real-time speech-to-speech streaming	Feels like a "next level" voice experience	Requires full-duplex WebSocket audio, interrupt handling, turn-taking logic, VAD on both ends — an entirely different architecture (Pipecat, LiveKit); out of scope for a relay bridge	Sequential pipeline (speak → wait → hear) is sufficient for assistant use cases; real-time is only needed for phone-call-style interaction
Per-agent Telegram bots	"My PM agent should have its own bot handle"	Multiple bots means multiple bot tokens, multiple webhook registrations, complex routing when agents hand off to each other; maintenance nightmare	Single bot with agent name prefix in messages: `[PM] Here is your sprint plan`; PROJECT.md explicitly out-of-scopes this
Deep Telegram ↔ web chat sync	"I want to see Telegram messages in the web UI"	Real-time bidirectional sync requires a shared event bus (Postgres LISTEN/NOTIFY or Redis pub/sub), session management across transports, and conflict resolution; PROJECT.md explicitly defers this to "Postgres bus" future milestone	Relay is one-way per session: Telegram message → agent → Telegram reply; web chat is a separate session
Wake word detection	"Hey Nexus, start recording"	Requires always-on microphone access, local wakeword model (Porcupine, OpenWakeWord), and careful battery/privacy handling; browser does not allow always-on mic	Mic button tap is sufficient; wake word is a future hardware device concern
Streaming TTS word-by-word	Feels maximally responsive	Browser audio playback of a stream of tiny WAV fragments causes clicks, gaps, and buffering issues; each Piper call has startup overhead; the sentence-buffered approach gives 95% of the benefit	Sentence-buffered playback (buffer on `.!?`); start playing sentence 1 while sentence 2 synthesizes
Inline code execution over Telegram	"I want to run tasks from Telegram"	Security: arbitrary code execution via an unauthenticated chat interface; scope: Telegram bridge is explicitly a thin relay, not a command interface	Support text and voice message relay only; task creation via conversational agent response is sufficient
GSD formatting / rich elements in Telegram	Telegram supports inline keyboards, threaded replies — use them	Telegram's formatting model (inline keyboards, callback queries) requires stateful session tracking; PROJECT.md explicitly out-of-scopes this	Plain text + Markdown v1 (which Telegram natively renders for bold/italic/code); no inline keyboards in v1.6
Transcription editing before sending	"Let me see the transcript before it goes to the agent"	Adds a confirmation step that breaks the hands-free voice flow; most users trust auto-send after VAD silence detection; optionally show transcript as a message in the UI after the fact	Show the detected transcript in the chat message bubble with a small "mic" icon; no edit step

Feature Dependencies

Transport-Agnostic VoicePipelineService
    └──wraps──> Existing /transcribe endpoint (Whisper) [already built]
    └──wraps──> Piper TTS binary/WASM [already built in browser; server-side is new]
    └──consumed-by──> Web chat mic button (browser calls server or uses WASM directly)
    └──consumed-by──> Telegram bridge (server-side calls VoicePipelineService)
    └──consumed-by──> Future transports (CLI, API, Command Center)

Voice Mode Flag
    └──set-by──> Web chat (user is in voice mode)
    └──set-by──> Telegram bridge (message arrived as voice note)
    └──consumed-by──> LLM prompt construction (appends no-markdown instruction)
    └──consumed-by──> Dual output pattern (triggers two-response format)
    └──consumed-by──> TTS synthesis (triggers auto-synthesis of response)

Dual Output Pattern
    └──requires──> Voice mode flag (only triggers in voice mode)
    └──requires──> LLM prompt engineering (structured SPOKEN/DETAILED format)
    └──produces──> Short prose (for TTS, Telegram reply)
    └──produces──> Full markdown (for web chat display, copy)

Web Chat Voice UI (silence detection + waveform)
    └──requires──> Existing VoiceRecordButton [already built — enhance, not replace]
    └──requires──> Web Audio API (AnalyserNode for amplitude) [browser built-in]
    └──enhances──> Voice Mode Toggle (waveform only visible when voice mode active)

Web Chat Audio Playback
    └──requires──> TTS synthesis output (WAV/MP3 audio buffer)
    └──requires──> Voice mode flag (auto-play only in full voice mode)
    └──independent──> waveform visualization (different UI component)

Telegram Bridge
    └──requires──> VoicePipelineService (for voice note handling)
    └──requires──> Existing chat API (POST /api/... for message relay)
    └──requires──> ffmpeg (OGG/Opus → WAV conversion for Whisper)
    └──requires──> Telegraf (Node.js bot framework)
    └──independent──> web chat UI changes

Onboarding STT/TTS Detection
    └──requires──> Existing VoiceStep [already built — update, not replace]
    └──requires──> VoicePipelineService availability check
    └──independent──> Telegram bridge

Dependency Notes

VoicePipelineService is the keystone: Build this first. It abstracts Whisper + Piper behind a clean interface. Every other v1.6 feature is a consumer. If this is skipped, the Telegram bridge and web improvements become duplicate, divergent code.
Voice mode flag must be stored on the message: Not just passed in memory. Future Command Center and Telegram both need to know retroactively whether a message was voice-originated.
Dual output is optional on non-voice messages: Text-mode messages do not need the SPOKEN variant. The prompt injection and response parsing only apply when voice_mode: true.
Telegram bridge has no UI: It's a server-side Node.js process (or Express route). No React changes needed for Telegram.
ffmpeg is a hard dependency for Telegram voice notes: Telegram sends OGG/Opus; Whisper expects WAV/MP3. ffmpeg must be available on the server. On Mac Mini this is brew install ffmpeg.
Web chat waveform enhances existing VoiceRecordButton: Do not replace it. The existing component handles MediaRecorder and send; add AudioWorklet/AnalyserNode visualization on top.

MVP Definition

Launch With (v1.6 Milestone)

Minimum viable set to make voice and Telegram genuinely useful, not just technically present.

VoicePipelineService — Transport-agnostic server-side Whisper + Piper abstraction. Why essential: gates all other features; prevents code duplication between web and Telegram.
Voice mode flag + dual output — LLM receives no-markdown instruction; response splits into spoken prose + full markdown. Why essential: spoken markdown sounds broken; this is what makes TTS usable.
Web chat silence detection + auto-submit — Amplitude-based VAD stops recording automatically and submits. Why essential: hands-free voice only works if the user does not have to click "send."
Web chat waveform visualization — Amplitude bars while recording. Why essential: without it, users cannot tell if the mic is picking up audio.
Web chat audio playback with auto-play toggle — Agent voice responses play inline. Why essential: without playback, TTS synthesis has nowhere to go.
Voice mode toggle setting — Three modes: text only / voice input only / full voice (input + output). Why essential: users need to control the modality per session.
Telegram text relay — Text messages in → agent response out, with agent prefix. Why essential: core use case for phone access.
Telegram voice note relay — Voice notes in → transcribe → agent → text reply. Why essential: mobile Telegram users default to voice notes.

Add After Validation (v1.6.x)

Telegram TTS reply option — Agent response synthesized and sent back as an OGG voice note. Trigger: user feedback that text replies are too long to read on phone.
Sentence-buffered TTS streaming — Start audio playback before full synthesis completes. Trigger: latency complaints with longer responses.
Voice response history in UI — Chat messages show audio player for past synthesized responses (not just the current one). Trigger: users want to replay previous responses.

Future Consideration (v2+)

Real-time speech-to-speech — Full-duplex conversation; requires Pipecat or LiveKit; entirely different architecture.
Wake word detection — Always-on mic, local wakeword model; hardware device concern.
Deep Telegram ↔ web sync — Bidirectional session mirroring via Postgres bus; deferred per PROJECT.md.
Per-transport voice models — Different Piper voice for Telegram vs. web (e.g., cleaner phone voice vs. natural assistant voice).

Feature Prioritization Matrix

Feature	User Value	Implementation Cost	Priority
VoicePipelineService	HIGH	MEDIUM	P1
Voice mode flag + dual output	HIGH	MEDIUM	P1
Silence detection + auto-submit	HIGH	MEDIUM	P1
Waveform visualization	MEDIUM	LOW	P1
Audio playback + auto-play toggle	HIGH	LOW	P1
Voice mode toggle setting	HIGH	LOW	P1
Telegram text relay	HIGH	MEDIUM	P1
Telegram voice note relay	HIGH	MEDIUM	P1
Telegram TTS reply	MEDIUM	MEDIUM	P2
Sentence-buffered TTS streaming	MEDIUM	MEDIUM	P2
Voice response history	LOW	MEDIUM	P3
Real-time speech-to-speech	HIGH	HIGH	P3 (v2+)

Priority key:

P1: Must have for v1.6 launch
P2: Should have, add in v1.6.x
P3: Nice to have, v2+

Competitor Feature Analysis

Feature	ChatGPT Voice Mode	Telegram + other bots	Nexus v1.6 Approach
STT	Whisper (cloud)	Per-bot (usually cloud)	Whisper local, CPU fallback
TTS	Custom neural (cloud)	gTTS or ElevenLabs	Piper local, CPU-only
Markdown-free voice	Yes (GPT strips markdown)	Usually not (bots send raw markdown)	Dual output: SPOKEN + DETAILED
Silence detection	Yes (VAD, full-duplex)	N/A	Amplitude VAD, 1.5s threshold
Waveform UI	Animated blobs (not literal waveform)	N/A	AnalyserNode amplitude bars
Agent identity in replies	N/A (single assistant)	Custom per bot	Text prefix `[AgentName]`
Telegram voice note support	N/A	Varies widely	OGG→WAV→Whisper→agent
Offline / local operation	No	No	Fully local: Whisper + Piper + Ollama
Transport abstraction	N/A	N/A	VoicePipelineService (web + Telegram share same service)

Voice Pipeline Architecture Notes

Confidence: HIGH for the cascading/sequential pipeline; MEDIUM for dual output prompt engineering reliability.

Sequential Pipeline (chosen architecture for v1.6)

[Browser/Telegram]
      |
      | audio buffer (WAV/OGG)
      v
VoicePipelineService.transcribe()
      |
      | transcript text + language + confidence
      v
LLM (with voice_mode prompt addendum)
      |
      | structured response: SPOKEN: "..." DETAILED: "..."
      v
Response parser → { spoken: string, detailed: string }
      |                        |
      |                        v
      |              Web chat: render detailed (markdown)
      |              Telegram: send spoken as text
      v
VoicePipelineService.synthesize(spoken)
      |
      | WAV audio buffer
      v
Web chat: <audio> element autoplay
Telegram (v2): sendVoice() as OGG/Opus

Why not real-time speech-to-speech:

Real-time requires full-duplex WebSocket audio, interrupt detection (barge-in), turn-taking state machine, and sub-200ms latency budgets. The sequential pattern targets <3s end-to-end on Apple Silicon M4, which is appropriate for assistant interactions (not phone calls). The complexity delta is enormous; PROJECT.md explicitly defers this.

Telegram Bridge Architecture Notes

Confidence: HIGH — Telegraf is the standard Node.js Telegram framework; patterns are well-established.

Single Bot, Agent Prefix Pattern

Telegram user sends: "What's the status of the Nexus project?"
                           |
                    Telegraf handler
                           |
                    POST /api/workspaces/:id/chat/messages
                    { content: "What's the status...", source: "telegram", voice_mode: false }
                           |
                    SSE stream → collect until [DONE]
                           |
                    bot.sendMessage(chatId, "[Hermes] The Nexus project is currently...")

Voice Note Flow

Telegram user sends voice note (OGG/Opus, ~15s)
                           |
                    Telegraf voice handler: bot.getFile() → download OGG
                           |
                    ffmpeg: OGG → WAV (16kHz mono)
                           |
                    VoicePipelineService.transcribe(wavBuffer)
                           |
                    POST /api/workspaces/:id/chat/messages
                    { content: transcript, source: "telegram", voice_mode: true }
                           |
                    Collect SSE stream → spoken variant of response
                           |
                    bot.sendMessage(chatId, "[Hermes] " + spokenResponse)
                    // v2: bot.sendVoice(chatId, synthesizedOggBuffer)

Key implementation decisions:

Polling vs. webhooks: Webhooks require a public HTTPS endpoint. For Mac Mini on home network, long polling is the correct choice. Telegraf supports both; use bot.launch() (polling mode) for v1.6.
Bot token storage: Environment variable TELEGRAM_BOT_TOKEN; added to .env and loaded via existing env config pattern.
Authorized users only: Store allowed Telegram user IDs or usernames in nexus-settings to prevent unauthorized access; a bridge with no auth is a security hole.
Conversation context: Each Telegram chat ID maps to a Nexus workspace session; maintain a telegramChatId → workspaceId + conversationId mapping in a lightweight in-memory store or SQLite table.

Voice Mode Response Formatting Notes

Confidence: MEDIUM — dual output prompt pattern is used in production systems but prompt reliability varies by model; post-processing strip is more reliable.

Two approaches, use both as fallback:

Approach A: Prompt-based dual output (preferred) Append to system prompt when voice_mode: true:

When responding, provide two versions:
SPOKEN: [1-3 sentences in natural spoken prose, no markdown, no symbols, no lists]
DETAILED: [Full response with markdown formatting, code blocks, bullet points as needed]

Parse response: split on SPOKEN: and DETAILED: markers.

Approach B: Post-processing strip (fallback) If the model doesn't follow the dual output format, post-process the full response:

Strip **bold** → "bold"
Strip `code` → "code"
Strip # headers → remove # prefix
Strip - bullet points → convert to sentences or strip
Strip code blocks → summarize as "[code example]" or remove entirely Use as the spoken variant. The full original markdown response is the detailed variant.

Reliable rule: Never read markdown symbols aloud. Either approach prevents this; dual output is preferred because it lets the LLM choose better phrasing for spoken delivery (short, natural sentences vs. information-dense bullets).

Sources

Feature research for: Nexus v1.6 Voice Pipeline + Minimal Telegram Bridge Researched: 2026-04-03

24 KiB Raw Blame History