nexus/.planning/research/ARCHITECTURE.md
2026-04-03 23:53:14 +00:00

28 KiB

Architecture Research

Domain: Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo Researched: 2026-04-03 Confidence: HIGH — based on direct codebase inspection + verified current documentation


System Overview

v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI.

+-----------------------------------------------------------------------------------+
|                              UI Layer (React/Vite)                                |
|                                                                                   |
|  +-------------------------------------------------------------------------+     |
|  |  ChatPanel / PersonalAssistant (MODIFIED)                               |     |
|  |  +---------------------+  +--------------------+  +------------------+ |     |
|  |  | VoiceMicButton (NEW)|  | WaveformDisplay    |  | TtsButton (v1.5) | |     |
|  |  | silence detection   |  | (NEW) animated bars|  | + auto-play prop | |     |
|  |  | auto-send on silence|  +--------------------+  +------------------+ |     |
|  |  +---------------------+                                               |     |
|  |  +-------------------------------------------------------------------+ |     |
|  |  | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle     | |     |
|  |  +-------------------------------------------------------------------+ |     |
|  |  +-------------------------------------------------------------------+ |     |
|  |  | VoiceModeToggle (NEW) — text only / voice input / full voice      | |     |
|  |  +-------------------------------------------------------------------+ |     |
|  +-------------------------------------------------------------------------+     |
+-----------------------------------------------------------------------------------+
                                        | HTTP + SSE
+-----------------------------------------------------------------------------------+
|                              Server Layer (Express)                               |
|                                                                                   |
|  +------------------------------------+  +------------------------------------+   |
|  |  voice.ts (NEW route)              |  |  telegram.ts (NEW route/service)   |   |
|  |  POST /transcribe  (MOVED)         |  |  grammY long-poll process          |   |
|  |  POST /synthesize  (NEW)           |  |  text + voice relay                |   |
|  +------------------------------------+  +------------------------------------+   |
|                    |                                    |                         |
|  +-----------------v--------------------------------------------v--------------+ |
|  |                    voicePipelineService (NEW — core)                          | |
|  |  transcribe(audioBuffer, format) -> string                                   | |
|  |  synthesize(text, voiceId?) -> Buffer (WAV)                                  | |
|  |  formatForVoice(text) -> { voice: string, full: string }                     | |
|  +------------------------------------------------------------------------------+ |
|                    |                                                               |
|  +-----------------v--------------------------------------------------------------+|
|  |               chatService / nexusSettingsService (EXISTING)                   ||
|  |   conversations . messages . stream SSE . memory . voiceEnabled               ||
|  +--------------------------------------------------------------------------------+|
|                    |                                                               |
|  +-----------------v--------------------------------------------------------------+|
|  |         External Processes (spawned via child_process.spawn / execFile)       ||
|  |   whisper-cpp / whisper (STT)          piper (TTS)                            ||
|  +--------------------------------------------------------------------------------+|
+-----------------------------------------------------------------------------------+
         ^
         | Telegram Bot API (HTTPS long-poll)
+--------+------------------------------------------------------------------------+
|                        Telegram (external service)                               |
|  User sends text -> bot relays to chatService -> SSE reply -> bot sends back     |
|  User sends voice -> bot downloads OGG -> voicePipelineService.transcribe()      |
|                    -> chatService -> reply -> voicePipelineService.synthesize()  |
|                    -> bot sends OGG audio reply                                  |
+----------------------------------------------------------------------------------+

Integration Points: New vs. Existing

What Stays Unchanged

Component Location Status
chatService server/src/services/chat.ts No changes — voice pipeline uses it as-is
nexusSettingsService server/src/services/nexus-settings.ts Extend schema only (add voiceMode, telegramToken)
chatFileRoutes server/src/routes/chat-files.ts /transcribe moves out; file upload stays
usePiperTts ui/src/hooks/usePiperTts.ts No changes — TtsButton continues using browser WASM
TtsButton ui/src/components/TtsButton.tsx Add auto-play prop only
SSE stream endpoint server/src/routes/chat.ts No changes — Telegram bridge calls services directly
DB schema packages/db No changes — voice is file/process, not a DB column

What Changes (MODIFIED)

Component Location Change
VoiceRecordButton ui/src/components/VoiceRecordButton.tsx Add silence detection, waveform data emission, auto-send on silence
ChatInput ui/src/components/ChatInput.tsx Wire new VoiceMicButton, add voice mode prop
ChatMessage ui/src/components/ChatMessage.tsx Show voice_mode badge, show dual output collapse/expand
nexusSettingsSchema server/src/services/nexus-settings.ts Add voiceMode enum and telegramToken optional string
app.ts server/src/app.ts Register voiceRoutes, telegramRoutes
createMessageSchema packages/shared/src/validators/chat.ts Add voiceMode: z.boolean().optional() flag on messages
ChatMessage type packages/shared/src/types/chat.ts Add `voiceMode: boolean
chat-files.ts server/src/routes/chat-files.ts Remove /transcribe handler (moved to voice.ts)

What Is New (NEW)

Component Location Purpose
voicePipelineService server/src/services/voice-pipeline.ts Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge
voice.ts (route) server/src/routes/voice.ts POST /api/transcribe, POST /api/synthesize — thin HTTP wrappers
telegram.ts (service) server/src/services/telegram.ts grammY bot init, long-poll loop, message relay, voice relay
telegram.ts (route) server/src/routes/telegram.ts GET /api/telegram/status, POST /api/telegram/token management endpoints
VoiceMicButton ui/src/components/VoiceMicButton.tsx Enhanced mic button with silence detection and waveform display
WaveformDisplay ui/src/components/WaveformDisplay.tsx Animated audio waveform bars using AnalyserNode
VoiceModeToggle ui/src/components/VoiceModeToggle.tsx Three-state toggle: text only / voice input / full voice
useVoiceMode ui/src/hooks/useVoiceMode.ts Reads/writes voice mode setting via /api/nexus-settings
useSilenceDetection ui/src/hooks/useSilenceDetection.ts Web Audio API AnalyserNode watching for 1.5s silence threshold

Component Boundaries

voicePipelineService (Core)

This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other.

Method Input Output Implementation
transcribe(buffer, format) Buffer, "webm" or "ogg" Promise<string> Writes temp file, uses execFile (not exec) to spawn whisper-cpp or whisper CLI, reads stdout, cleans up
synthesize(text, voiceId?) string, optional voiceId Promise<Buffer> Spawns piper CLI via spawn, pipes text to stdin, collects WAV stdout
formatForVoice(text) string { voice: string; full: string } Strips code blocks and markdown for voice; returns both variants

The transcribe method extends the existing /transcribe implementation from chat-files.ts by adding an ogg format path alongside the existing webm path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved.

Why a dedicated service vs. inline in routes: The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation.

telegram service

A thin relay, not a feature-rich bot. It:

  1. Holds a single grammY Bot instance, initialized when telegramToken is set in nexus-settings
  2. Routes text messages to chatService.addMessage() then collects AI response via puterProxyService.chatStream()
  3. Routes voice messages — downloads OGG file, calls voicePipelineService.transcribe(), then same text path
  4. If voiceMode === "full_voice": calls voicePipelineService.synthesize(), sends audio back via ctx.replyWithAudio()
  5. Prefixes agent name on replies: [Agent Name]: message text

No per-user conversation tracking. All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md.

Voice Route vs. Chat Files Route

The existing /transcribe endpoint lives inside chatFileRoutes in chat-files.ts. For v1.6, the endpoint moves to a dedicated voice.ts route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file.

Moving the handler reduces merge conflict surface on future upstream rebases of chat-files.ts.


server/src/
  app.ts                         # MODIFY: register voiceRoutes, telegramRoutes
  routes/
    chat-files.ts                # MODIFY: remove /transcribe handler (moved to voice.ts)
    voice.ts                     # NEW: POST /transcribe, POST /synthesize
    nexus-settings.ts            # MODIFY: expose voiceMode + telegramToken fields
    telegram.ts                  # NEW: GET /telegram/status, POST /telegram/token
  services/
    voice-pipeline.ts            # NEW: transcribe(), synthesize(), formatForVoice()
    telegram.ts                  # NEW: grammY bot lifecycle + relay logic
    nexus-settings.ts            # MODIFY: add voiceMode + telegramToken to schema

ui/src/
  components/
    VoiceMicButton.tsx           # NEW: replaces VoiceRecordButton in ChatInput
    WaveformDisplay.tsx          # NEW: animated bars from AnalyserNode data
    VoiceModeToggle.tsx          # NEW: 3-state toggle (text / voice-in / full-voice)
    VoiceRecordButton.tsx        # KEEP as-is (still used in file upload contexts)
    TtsButton.tsx                # MODIFY: add autoPlay prop
    ChatInput.tsx                # MODIFY: add VoiceModeToggle, swap in VoiceMicButton
    ChatMessage.tsx              # MODIFY: voice_mode badge + dual output expand
  hooks/
    useVoiceMode.ts              # NEW: reads/writes voiceMode setting
    useSilenceDetection.ts       # NEW: AnalyserNode silence threshold
    usePiperTts.ts               # KEEP as-is (browser-side TTS unchanged)

packages/shared/src/
  validators/chat.ts             # MODIFY: add voiceMode flag to createMessageSchema
  types/chat.ts                  # MODIFY: add voiceMode field to ChatMessage

Architectural Patterns

Pattern 1: Transport-Agnostic Voice Service

What: A server service (voicePipelineService) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly.

When to use: Any time two transports (web + bot) need the same capability.

Trade-offs: Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently.

Shape:

// server/src/services/voice-pipeline.ts
export function voicePipelineService() {
  // Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern
  async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise<string>;
  async function synthesize(text: string, voiceId?: string): Promise<Buffer>;
  function formatForVoice(text: string): { voice: string; full: string };
  return { transcribe, synthesize, formatForVoice };
}

The existing /transcribe handler in chat-files.ts already uses promisify(execFile) — this pattern is the right model. The service wraps it with format selection (webm vs ogg) and the same whisper-cpp → openai-whisper cascade.

Pattern 2: Thin Telegram Relay

What: The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram.

When to use: Building a disposable bridge that will be replaced by a richer implementation later.

Trade-offs: No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope.

Shape:

// server/src/services/telegram.ts
import { Bot } from "grammy";

export function telegramService(db: Db) {
  let bot: Bot | null = null;

  function start(token: string): void; // idempotent, long-poll
  function stop(): void;
  function isRunning(): boolean;

  return { start, stop, isRunning };
}

The bot calls chatService(db) and puterProxyService(db) directly — no HTTP round-trip to the same server.

Pattern 3: Voice Mode Flag on Messages

What: Each message carries an optional voiceMode: boolean flag. When true, the server formats the response for voice (dual output: voice + full), and the client auto-plays TTS and shows the full text in a collapsible block.

When to use: Differentiating voice-initiated messages from text messages within the same conversation.

Trade-offs: Adds a field to createMessageSchema and the ChatMessage type. The field is optional and defaults to false, so existing messages and the upstream schema are not broken.

Schema change:

// packages/shared/src/validators/chat.ts — additive only
export const createMessageSchema = z.object({
  role: z.enum(["user", "assistant", "system"]),
  content: z.string().min(1).max(100_000),
  agentId: z.string().uuid().optional(),
  messageType: z.string().optional(),
  voiceMode: z.boolean().optional(),  // NEW in v1.6
});

Pattern 4: Direct Service Calls in Telegram Bridge

What: The Telegram bot does not call the Express HTTP API to get AI responses. It calls chatService(db) and puterProxyService(db) as regular TypeScript function calls within the same server process.

When to use: Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip.

Trade-offs: Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool.

Why not HTTP: A fetch("http://localhost:PORT/api/...") call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development.

Pattern 5: grammY Long-Poll for Single-User Local Deployment

What: Use grammY bot.start() (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running.

When to use: Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain.

Trade-offs: Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use.

Lifecycle:

  • Start: nexusSettingsService().get() finds telegramToken set → telegramService(db).start(token)
  • Stop: server.close()telegramService(db).stop()
  • Runtime toggle: POST /api/telegram/token updates nexus-settings and calls start/stop

Data Flow

Web Voice Input Flow

User holds mic button
    |
    v
VoiceMicButton: MediaRecorder + AnalyserNode
    |
    v (silence detected after 1.5s or stop pressed)
POST /api/transcribe {audio: webm blob}
    |
    v
voice.ts route -> voicePipelineService.transcribe(buffer, "webm")
    |
    v (whisper-cpp or openai-whisper CLI via execFile)
{ text: "transcribed text" }
    |
    v
ChatInput fills textarea -> user sends (message tagged voiceMode: true)
    |
    v
POST /conversations/:id/stream -> chatService + puterProxyService
    |
    v (SSE tokens arrive)
ChatMessage with voice_mode badge + dual output (voice text + full text collapsible)
    |
    v
TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5)

Server-Side TTS Flow (POST /synthesize)

POST /api/synthesize { text, voiceId? }
    |
    v
voice.ts route -> voicePipelineService.synthesize(text)
    |
    v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout)
Response: Content-Type audio/wav, Buffer body
    |
    v
Client: new Audio(URL.createObjectURL(blob)).play()

Note: Server-side /synthesize is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side usePiperTts WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward.

Telegram Text Message Flow

Telegram user sends text
    |
    v
grammY bot.on("message:text") handler
    |
    v
telegramService: resolveOrCreateConversation(db)
    |
    v
chatService(db).addMessage(conversationId, { role: "user", content: text })
    |
    v
telegramService: collect full response via puterProxyService(db).chatStream()
    |
    v (if voiceMode !== "full_voice")
ctx.reply("[AgentName]: full_response_text")

    | (if voiceMode === "full_voice")
    v
voicePipelineService.formatForVoice(response) -> { voice, full }
ctx.reply("[AgentName]: " + full)  -- text message with full details
    |
    v
voicePipelineService.synthesize(voice) -> WAV Buffer
ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg"))

Telegram Voice Message Flow

Telegram user sends voice note (OGG Opus format)
    |
    v
grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer
    |
    v
voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text
    |
    v
(same path as Telegram text message above)

nexus-settings Schema Evolution

v1.5:  { mode, voiceEnabled }
v1.6:  { mode, voiceEnabled, voiceMode, telegramToken }

  voiceMode:     "text" | "voice_input" | "full_voice"  (default: "text")
  telegramToken: string | undefined                      (set by user via UI or POST /telegram/token)

voiceMode is a workspace-level setting (not per-agent). The three states map to:

  • "text": mic button transcribes to text input, TTS manual-only, Telegram text-only
  • "voice_input": mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out
  • "full_voice": mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out

Scaling Considerations

This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility.

Concern At 1 user (target) Notes
STT latency whisper-cpp base.en on M4: ~1-3s Acceptable; shows transcribing spinner
TTS latency piper CLI on M4: ~0.3-1s for short text <3s target met
Telegram poll grammY bot.start(), 1 process Adequate for <5,000 msgs/hour
Memory overhead ~10-20MB for polling loop Acceptable on 16GB+ M4
Piper model First server-side synthesize: cold start Piper loads model into memory; subsequent calls fast

Anti-Patterns

Anti-Pattern 1: Telegram-Specific Voice Logic

What people do: Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler.

Why it's wrong: Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation.

Do this instead: All voice processing goes through voicePipelineService. The Telegram handler calls transcribe(buf, "ogg") — the service handles format differences. The web route calls transcribe(buf, "webm") — same service, different format argument.

Anti-Pattern 2: Circular HTTP Call for Telegram AI Response

What people do: Telegram bot handler calls fetch("http://localhost:PORT/api/conversations/:id/stream") to get AI responses from within the same server process.

Why it's wrong: Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running.

Do this instead: telegramService imports chatService(db) and puterProxyService(db) directly. Collect tokens from the async generator into a string, then send to Telegram as a single message.

Anti-Pattern 3: Blocking grammY on Slow CLI Processes

What people do: await synthesize() inside a bot handler with no timeout, assuming piper is always available and fast.

Why it's wrong: If the piper binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely.

Do this instead: Wrap CLI calls in a Promise.race([piperCall, timeout(8_000)]). If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode.

Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts

What people do: Leave the STT handler in chat-files.ts and call voicePipelineService from there, adding Nexus-specific logic to an upstream-sourced file.

Why it's wrong: chat-files.ts is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface.

Do this instead: Move /transcribe and /synthesize to a new voice.ts route file (Nexus-only, never in upstream). Keep chat-files.ts as close to upstream as possible.

Anti-Pattern 5: Storing Telegram Token in Database

What people do: Create a new DB table or add a column to instance_settings to store the Telegram bot token.

Why it's wrong: Any DB schema change blocks upstream rebase (migration files conflict). The nexus-settings.json file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent.

Do this instead: Store telegramToken in nexus-settings.json via the existing nexusSettingsService. Same pattern as voiceEnabled, mode.


Integration Points

External Services

Service Integration Pattern Notes
Telegram Bot API grammY bot.start() long-polling (Node.js) No public URL required; polling starts on server boot if token present in nexus-settings
whisper-cpp / openai-whisper execFile cascade (same as existing /transcribe) Format argument added: writes .webm or .ogg temp file based on input
piper TTS binary child_process.spawn stdin -> stdout Text piped to stdin; WAV or raw PCM bytes collected from stdout

Internal Boundaries

Boundary Communication Notes
voice route <-> voicePipelineService Direct function call Route is thin HTTP wrapper; all logic in service
telegram service <-> voicePipelineService Direct function call Same service used by both transports
telegram service <-> chatService Direct function call Bot calls chatService(db) directly — no HTTP round-trip
telegram service <-> nexusSettingsService Direct function call Reads voiceMode and telegramToken at start and on each message
web UI <-> voice route REST: POST /api/transcribe, POST /api/synthesize Web client uses browser-side piper WASM for TTS; /synthesize primarily for Telegram
UI VoiceModeToggle <-> nexus-settings REST: PATCH /api/nexus-settings Reads/writes voiceMode setting

Build Order

Based on component dependencies, the recommended build order within this milestone:

Step Component(s) Reason
1 nexus-settings schema extensions (voiceMode, telegramToken) Everything downstream reads settings
2 voicePipelineService Backs all voice. No new deps. Independently testable.
3 voice.ts route (POST /transcribe, POST /synthesize) Thin wrapper. Register in app.ts. Move handler from chat-files.
4 VoiceMicButton + WaveformDisplay + useSilenceDetection Pure UI. Depends only on /transcribe.
5 VoiceModeToggle + useVoiceMode Depends on voiceMode in nexus-settings schema (Step 1).
6 ChatMessage dual output Depends on voiceMode in shared ChatMessage type.
7 createMessageSchema + ChatMessage type (voiceMode flag) Shared package change. Required by Steps 5-6. Could move earlier.
8 telegramService Depends on voicePipelineService (2), chatService (existing), nexusSettings (1).
9 telegram.ts route + app.ts registration Management endpoints. Needs telegramService.
10 Onboarding STT/TTS hardware detection step Final: wires all voice detection into onboarding flow.

Steps 4-6 can run in parallel with Steps 7-9 if split across phases.


Sources

  • Direct codebase inspection: server/src/routes/chat-files.ts (lines 297-386), server/src/routes/chat.ts, server/src/services/nexus-settings.ts, server/src/app.ts, ui/src/components/VoiceRecordButton.tsx, ui/src/components/TtsButton.tsx, ui/src/hooks/usePiperTts.ts, packages/shared/src/validators/chat.ts, packages/shared/src/types/chat.ts
  • .planning/STATE.md — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
  • .planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md — existing voice implementation details, WASM TTS pattern
  • grammY documentation — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks
  • grammY deployment types guide — long polling recommended for single-user local; Express integration pattern
  • rhasspy/piper (archived) — CLI: echo "text" | piper --model voice.onnx -f -; development moved to OHF-Voice/piper1-gpl Oct 2025
  • grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed

Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6) Researched: 2026-04-03