28 KiB
Architecture Research
Domain: Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo Researched: 2026-04-03 Confidence: HIGH — based on direct codebase inspection + verified current documentation
System Overview
v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI.
+-----------------------------------------------------------------------------------+
| UI Layer (React/Vite) |
| |
| +-------------------------------------------------------------------------+ |
| | ChatPanel / PersonalAssistant (MODIFIED) | |
| | +---------------------+ +--------------------+ +------------------+ | |
| | | VoiceMicButton (NEW)| | WaveformDisplay | | TtsButton (v1.5) | | |
| | | silence detection | | (NEW) animated bars| | + auto-play prop | | |
| | | auto-send on silence| +--------------------+ +------------------+ | |
| | +---------------------+ | |
| | +-------------------------------------------------------------------+ | |
| | | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle | | |
| | +-------------------------------------------------------------------+ | |
| | +-------------------------------------------------------------------+ | |
| | | VoiceModeToggle (NEW) — text only / voice input / full voice | | |
| | +-------------------------------------------------------------------+ | |
| +-------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------+
| HTTP + SSE
+-----------------------------------------------------------------------------------+
| Server Layer (Express) |
| |
| +------------------------------------+ +------------------------------------+ |
| | voice.ts (NEW route) | | telegram.ts (NEW route/service) | |
| | POST /transcribe (MOVED) | | grammY long-poll process | |
| | POST /synthesize (NEW) | | text + voice relay | |
| +------------------------------------+ +------------------------------------+ |
| | | |
| +-----------------v--------------------------------------------v--------------+ |
| | voicePipelineService (NEW — core) | |
| | transcribe(audioBuffer, format) -> string | |
| | synthesize(text, voiceId?) -> Buffer (WAV) | |
| | formatForVoice(text) -> { voice: string, full: string } | |
| +------------------------------------------------------------------------------+ |
| | |
| +-----------------v--------------------------------------------------------------+|
| | chatService / nexusSettingsService (EXISTING) ||
| | conversations . messages . stream SSE . memory . voiceEnabled ||
| +--------------------------------------------------------------------------------+|
| | |
| +-----------------v--------------------------------------------------------------+|
| | External Processes (spawned via child_process.spawn / execFile) ||
| | whisper-cpp / whisper (STT) piper (TTS) ||
| +--------------------------------------------------------------------------------+|
+-----------------------------------------------------------------------------------+
^
| Telegram Bot API (HTTPS long-poll)
+--------+------------------------------------------------------------------------+
| Telegram (external service) |
| User sends text -> bot relays to chatService -> SSE reply -> bot sends back |
| User sends voice -> bot downloads OGG -> voicePipelineService.transcribe() |
| -> chatService -> reply -> voicePipelineService.synthesize() |
| -> bot sends OGG audio reply |
+----------------------------------------------------------------------------------+
Integration Points: New vs. Existing
What Stays Unchanged
| Component | Location | Status |
|---|---|---|
chatService |
server/src/services/chat.ts |
No changes — voice pipeline uses it as-is |
nexusSettingsService |
server/src/services/nexus-settings.ts |
Extend schema only (add voiceMode, telegramToken) |
chatFileRoutes |
server/src/routes/chat-files.ts |
/transcribe moves out; file upload stays |
usePiperTts |
ui/src/hooks/usePiperTts.ts |
No changes — TtsButton continues using browser WASM |
TtsButton |
ui/src/components/TtsButton.tsx |
Add auto-play prop only |
| SSE stream endpoint | server/src/routes/chat.ts |
No changes — Telegram bridge calls services directly |
| DB schema | packages/db |
No changes — voice is file/process, not a DB column |
What Changes (MODIFIED)
| Component | Location | Change |
|---|---|---|
VoiceRecordButton |
ui/src/components/VoiceRecordButton.tsx |
Add silence detection, waveform data emission, auto-send on silence |
ChatInput |
ui/src/components/ChatInput.tsx |
Wire new VoiceMicButton, add voice mode prop |
ChatMessage |
ui/src/components/ChatMessage.tsx |
Show voice_mode badge, show dual output collapse/expand |
nexusSettingsSchema |
server/src/services/nexus-settings.ts |
Add voiceMode enum and telegramToken optional string |
app.ts |
server/src/app.ts |
Register voiceRoutes, telegramRoutes |
createMessageSchema |
packages/shared/src/validators/chat.ts |
Add voiceMode: z.boolean().optional() flag on messages |
ChatMessage type |
packages/shared/src/types/chat.ts |
Add `voiceMode: boolean |
chat-files.ts |
server/src/routes/chat-files.ts |
Remove /transcribe handler (moved to voice.ts) |
What Is New (NEW)
| Component | Location | Purpose |
|---|---|---|
voicePipelineService |
server/src/services/voice-pipeline.ts |
Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge |
voice.ts (route) |
server/src/routes/voice.ts |
POST /api/transcribe, POST /api/synthesize — thin HTTP wrappers |
telegram.ts (service) |
server/src/services/telegram.ts |
grammY bot init, long-poll loop, message relay, voice relay |
telegram.ts (route) |
server/src/routes/telegram.ts |
GET /api/telegram/status, POST /api/telegram/token management endpoints |
VoiceMicButton |
ui/src/components/VoiceMicButton.tsx |
Enhanced mic button with silence detection and waveform display |
WaveformDisplay |
ui/src/components/WaveformDisplay.tsx |
Animated audio waveform bars using AnalyserNode |
VoiceModeToggle |
ui/src/components/VoiceModeToggle.tsx |
Three-state toggle: text only / voice input / full voice |
useVoiceMode |
ui/src/hooks/useVoiceMode.ts |
Reads/writes voice mode setting via /api/nexus-settings |
useSilenceDetection |
ui/src/hooks/useSilenceDetection.ts |
Web Audio API AnalyserNode watching for 1.5s silence threshold |
Component Boundaries
voicePipelineService (Core)
This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other.
| Method | Input | Output | Implementation |
|---|---|---|---|
transcribe(buffer, format) |
Buffer, "webm" or "ogg" |
Promise<string> |
Writes temp file, uses execFile (not exec) to spawn whisper-cpp or whisper CLI, reads stdout, cleans up |
synthesize(text, voiceId?) |
string, optional voiceId |
Promise<Buffer> |
Spawns piper CLI via spawn, pipes text to stdin, collects WAV stdout |
formatForVoice(text) |
string |
{ voice: string; full: string } |
Strips code blocks and markdown for voice; returns both variants |
The transcribe method extends the existing /transcribe implementation from chat-files.ts by adding an ogg format path alongside the existing webm path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved.
Why a dedicated service vs. inline in routes: The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation.
telegram service
A thin relay, not a feature-rich bot. It:
- Holds a single grammY
Botinstance, initialized whentelegramTokenis set in nexus-settings - Routes text messages to
chatService.addMessage()then collects AI response viaputerProxyService.chatStream() - Routes voice messages — downloads OGG file, calls
voicePipelineService.transcribe(), then same text path - If
voiceMode === "full_voice": callsvoicePipelineService.synthesize(), sends audio back viactx.replyWithAudio() - Prefixes agent name on replies:
[Agent Name]: message text
No per-user conversation tracking. All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md.
Voice Route vs. Chat Files Route
The existing /transcribe endpoint lives inside chatFileRoutes in chat-files.ts. For v1.6, the endpoint moves to a dedicated voice.ts route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file.
Moving the handler reduces merge conflict surface on future upstream rebases of chat-files.ts.
Recommended Project Structure
server/src/
app.ts # MODIFY: register voiceRoutes, telegramRoutes
routes/
chat-files.ts # MODIFY: remove /transcribe handler (moved to voice.ts)
voice.ts # NEW: POST /transcribe, POST /synthesize
nexus-settings.ts # MODIFY: expose voiceMode + telegramToken fields
telegram.ts # NEW: GET /telegram/status, POST /telegram/token
services/
voice-pipeline.ts # NEW: transcribe(), synthesize(), formatForVoice()
telegram.ts # NEW: grammY bot lifecycle + relay logic
nexus-settings.ts # MODIFY: add voiceMode + telegramToken to schema
ui/src/
components/
VoiceMicButton.tsx # NEW: replaces VoiceRecordButton in ChatInput
WaveformDisplay.tsx # NEW: animated bars from AnalyserNode data
VoiceModeToggle.tsx # NEW: 3-state toggle (text / voice-in / full-voice)
VoiceRecordButton.tsx # KEEP as-is (still used in file upload contexts)
TtsButton.tsx # MODIFY: add autoPlay prop
ChatInput.tsx # MODIFY: add VoiceModeToggle, swap in VoiceMicButton
ChatMessage.tsx # MODIFY: voice_mode badge + dual output expand
hooks/
useVoiceMode.ts # NEW: reads/writes voiceMode setting
useSilenceDetection.ts # NEW: AnalyserNode silence threshold
usePiperTts.ts # KEEP as-is (browser-side TTS unchanged)
packages/shared/src/
validators/chat.ts # MODIFY: add voiceMode flag to createMessageSchema
types/chat.ts # MODIFY: add voiceMode field to ChatMessage
Architectural Patterns
Pattern 1: Transport-Agnostic Voice Service
What: A server service (voicePipelineService) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly.
When to use: Any time two transports (web + bot) need the same capability.
Trade-offs: Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently.
Shape:
// server/src/services/voice-pipeline.ts
export function voicePipelineService() {
// Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern
async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise<string>;
async function synthesize(text: string, voiceId?: string): Promise<Buffer>;
function formatForVoice(text: string): { voice: string; full: string };
return { transcribe, synthesize, formatForVoice };
}
The existing /transcribe handler in chat-files.ts already uses promisify(execFile) — this pattern is the right model. The service wraps it with format selection (webm vs ogg) and the same whisper-cpp → openai-whisper cascade.
Pattern 2: Thin Telegram Relay
What: The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram.
When to use: Building a disposable bridge that will be replaced by a richer implementation later.
Trade-offs: No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope.
Shape:
// server/src/services/telegram.ts
import { Bot } from "grammy";
export function telegramService(db: Db) {
let bot: Bot | null = null;
function start(token: string): void; // idempotent, long-poll
function stop(): void;
function isRunning(): boolean;
return { start, stop, isRunning };
}
The bot calls chatService(db) and puterProxyService(db) directly — no HTTP round-trip to the same server.
Pattern 3: Voice Mode Flag on Messages
What: Each message carries an optional voiceMode: boolean flag. When true, the server formats the response for voice (dual output: voice + full), and the client auto-plays TTS and shows the full text in a collapsible block.
When to use: Differentiating voice-initiated messages from text messages within the same conversation.
Trade-offs: Adds a field to createMessageSchema and the ChatMessage type. The field is optional and defaults to false, so existing messages and the upstream schema are not broken.
Schema change:
// packages/shared/src/validators/chat.ts — additive only
export const createMessageSchema = z.object({
role: z.enum(["user", "assistant", "system"]),
content: z.string().min(1).max(100_000),
agentId: z.string().uuid().optional(),
messageType: z.string().optional(),
voiceMode: z.boolean().optional(), // NEW in v1.6
});
Pattern 4: Direct Service Calls in Telegram Bridge
What: The Telegram bot does not call the Express HTTP API to get AI responses. It calls chatService(db) and puterProxyService(db) as regular TypeScript function calls within the same server process.
When to use: Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip.
Trade-offs: Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool.
Why not HTTP: A fetch("http://localhost:PORT/api/...") call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development.
Pattern 5: grammY Long-Poll for Single-User Local Deployment
What: Use grammY bot.start() (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running.
When to use: Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain.
Trade-offs: Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use.
Lifecycle:
- Start:
nexusSettingsService().get()findstelegramTokenset →telegramService(db).start(token) - Stop:
server.close()→telegramService(db).stop() - Runtime toggle:
POST /api/telegram/tokenupdates nexus-settings and calls start/stop
Data Flow
Web Voice Input Flow
User holds mic button
|
v
VoiceMicButton: MediaRecorder + AnalyserNode
|
v (silence detected after 1.5s or stop pressed)
POST /api/transcribe {audio: webm blob}
|
v
voice.ts route -> voicePipelineService.transcribe(buffer, "webm")
|
v (whisper-cpp or openai-whisper CLI via execFile)
{ text: "transcribed text" }
|
v
ChatInput fills textarea -> user sends (message tagged voiceMode: true)
|
v
POST /conversations/:id/stream -> chatService + puterProxyService
|
v (SSE tokens arrive)
ChatMessage with voice_mode badge + dual output (voice text + full text collapsible)
|
v
TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5)
Server-Side TTS Flow (POST /synthesize)
POST /api/synthesize { text, voiceId? }
|
v
voice.ts route -> voicePipelineService.synthesize(text)
|
v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout)
Response: Content-Type audio/wav, Buffer body
|
v
Client: new Audio(URL.createObjectURL(blob)).play()
Note: Server-side /synthesize is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side usePiperTts WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward.
Telegram Text Message Flow
Telegram user sends text
|
v
grammY bot.on("message:text") handler
|
v
telegramService: resolveOrCreateConversation(db)
|
v
chatService(db).addMessage(conversationId, { role: "user", content: text })
|
v
telegramService: collect full response via puterProxyService(db).chatStream()
|
v (if voiceMode !== "full_voice")
ctx.reply("[AgentName]: full_response_text")
| (if voiceMode === "full_voice")
v
voicePipelineService.formatForVoice(response) -> { voice, full }
ctx.reply("[AgentName]: " + full) -- text message with full details
|
v
voicePipelineService.synthesize(voice) -> WAV Buffer
ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg"))
Telegram Voice Message Flow
Telegram user sends voice note (OGG Opus format)
|
v
grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer
|
v
voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text
|
v
(same path as Telegram text message above)
nexus-settings Schema Evolution
v1.5: { mode, voiceEnabled }
v1.6: { mode, voiceEnabled, voiceMode, telegramToken }
voiceMode: "text" | "voice_input" | "full_voice" (default: "text")
telegramToken: string | undefined (set by user via UI or POST /telegram/token)
voiceMode is a workspace-level setting (not per-agent). The three states map to:
"text": mic button transcribes to text input, TTS manual-only, Telegram text-only"voice_input": mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out"full_voice": mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out
Scaling Considerations
This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility.
| Concern | At 1 user (target) | Notes |
|---|---|---|
| STT latency | whisper-cpp base.en on M4: ~1-3s | Acceptable; shows transcribing spinner |
| TTS latency | piper CLI on M4: ~0.3-1s for short text | <3s target met |
| Telegram poll | grammY bot.start(), 1 process |
Adequate for <5,000 msgs/hour |
| Memory overhead | ~10-20MB for polling loop | Acceptable on 16GB+ M4 |
| Piper model | First server-side synthesize: cold start | Piper loads model into memory; subsequent calls fast |
Anti-Patterns
Anti-Pattern 1: Telegram-Specific Voice Logic
What people do: Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler.
Why it's wrong: Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation.
Do this instead: All voice processing goes through voicePipelineService. The Telegram handler calls transcribe(buf, "ogg") — the service handles format differences. The web route calls transcribe(buf, "webm") — same service, different format argument.
Anti-Pattern 2: Circular HTTP Call for Telegram AI Response
What people do: Telegram bot handler calls fetch("http://localhost:PORT/api/conversations/:id/stream") to get AI responses from within the same server process.
Why it's wrong: Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running.
Do this instead: telegramService imports chatService(db) and puterProxyService(db) directly. Collect tokens from the async generator into a string, then send to Telegram as a single message.
Anti-Pattern 3: Blocking grammY on Slow CLI Processes
What people do: await synthesize() inside a bot handler with no timeout, assuming piper is always available and fast.
Why it's wrong: If the piper binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely.
Do this instead: Wrap CLI calls in a Promise.race([piperCall, timeout(8_000)]). If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode.
Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts
What people do: Leave the STT handler in chat-files.ts and call voicePipelineService from there, adding Nexus-specific logic to an upstream-sourced file.
Why it's wrong: chat-files.ts is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface.
Do this instead: Move /transcribe and /synthesize to a new voice.ts route file (Nexus-only, never in upstream). Keep chat-files.ts as close to upstream as possible.
Anti-Pattern 5: Storing Telegram Token in Database
What people do: Create a new DB table or add a column to instance_settings to store the Telegram bot token.
Why it's wrong: Any DB schema change blocks upstream rebase (migration files conflict). The nexus-settings.json file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent.
Do this instead: Store telegramToken in nexus-settings.json via the existing nexusSettingsService. Same pattern as voiceEnabled, mode.
Integration Points
External Services
| Service | Integration Pattern | Notes |
|---|---|---|
| Telegram Bot API | grammY bot.start() long-polling (Node.js) |
No public URL required; polling starts on server boot if token present in nexus-settings |
| whisper-cpp / openai-whisper | execFile cascade (same as existing /transcribe) |
Format argument added: writes .webm or .ogg temp file based on input |
| piper TTS binary | child_process.spawn stdin -> stdout |
Text piped to stdin; WAV or raw PCM bytes collected from stdout |
Internal Boundaries
| Boundary | Communication | Notes |
|---|---|---|
| voice route <-> voicePipelineService | Direct function call | Route is thin HTTP wrapper; all logic in service |
| telegram service <-> voicePipelineService | Direct function call | Same service used by both transports |
| telegram service <-> chatService | Direct function call | Bot calls chatService(db) directly — no HTTP round-trip |
| telegram service <-> nexusSettingsService | Direct function call | Reads voiceMode and telegramToken at start and on each message |
| web UI <-> voice route | REST: POST /api/transcribe, POST /api/synthesize |
Web client uses browser-side piper WASM for TTS; /synthesize primarily for Telegram |
| UI VoiceModeToggle <-> nexus-settings | REST: PATCH /api/nexus-settings |
Reads/writes voiceMode setting |
Build Order
Based on component dependencies, the recommended build order within this milestone:
| Step | Component(s) | Reason |
|---|---|---|
| 1 | nexus-settings schema extensions (voiceMode, telegramToken) |
Everything downstream reads settings |
| 2 | voicePipelineService |
Backs all voice. No new deps. Independently testable. |
| 3 | voice.ts route (POST /transcribe, POST /synthesize) |
Thin wrapper. Register in app.ts. Move handler from chat-files. |
| 4 | VoiceMicButton + WaveformDisplay + useSilenceDetection |
Pure UI. Depends only on /transcribe. |
| 5 | VoiceModeToggle + useVoiceMode |
Depends on voiceMode in nexus-settings schema (Step 1). |
| 6 | ChatMessage dual output |
Depends on voiceMode in shared ChatMessage type. |
| 7 | createMessageSchema + ChatMessage type (voiceMode flag) |
Shared package change. Required by Steps 5-6. Could move earlier. |
| 8 | telegramService |
Depends on voicePipelineService (2), chatService (existing), nexusSettings (1). |
| 9 | telegram.ts route + app.ts registration |
Management endpoints. Needs telegramService. |
| 10 | Onboarding STT/TTS hardware detection step | Final: wires all voice detection into onboarding flow. |
Steps 4-6 can run in parallel with Steps 7-9 if split across phases.
Sources
- Direct codebase inspection:
server/src/routes/chat-files.ts(lines 297-386),server/src/routes/chat.ts,server/src/services/nexus-settings.ts,server/src/app.ts,ui/src/components/VoiceRecordButton.tsx,ui/src/components/TtsButton.tsx,ui/src/hooks/usePiperTts.ts,packages/shared/src/validators/chat.ts,packages/shared/src/types/chat.ts .planning/STATE.md— v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag).planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md— existing voice implementation details, WASM TTS pattern- grammY documentation — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks
- grammY deployment types guide — long polling recommended for single-user local; Express integration pattern
- rhasspy/piper (archived) — CLI:
echo "text" | piper --model voice.onnx -f -; development moved to OHF-Voice/piper1-gpl Oct 2025 - grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed
Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6) Researched: 2026-04-03