Nexus Dev 0c29013931 docs: complete project research

2026-04-03 23:53:14 +00:00

28 KiB

Raw Blame History

Architecture Research

Domain: Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo Researched: 2026-04-03 Confidence: HIGH — based on direct codebase inspection + verified current documentation

System Overview

v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI.

+-----------------------------------------------------------------------------------+
|                              UI Layer (React/Vite)                                |
|                                                                                   |
|  +-------------------------------------------------------------------------+     |
|  |  ChatPanel / PersonalAssistant (MODIFIED)                               |     |
|  |  +---------------------+  +--------------------+  +------------------+ |     |
|  |  | VoiceMicButton (NEW)|  | WaveformDisplay    |  | TtsButton (v1.5) | |     |
|  |  | silence detection   |  | (NEW) animated bars|  | + auto-play prop | |     |
|  |  | auto-send on silence|  +--------------------+  +------------------+ |     |
|  |  +---------------------+                                               |     |
|  |  +-------------------------------------------------------------------+ |     |
|  |  | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle     | |     |
|  |  +-------------------------------------------------------------------+ |     |
|  |  +-------------------------------------------------------------------+ |     |
|  |  | VoiceModeToggle (NEW) — text only / voice input / full voice      | |     |
|  |  +-------------------------------------------------------------------+ |     |
|  +-------------------------------------------------------------------------+     |
+-----------------------------------------------------------------------------------+
                                        | HTTP + SSE
+-----------------------------------------------------------------------------------+
|                              Server Layer (Express)                               |
|                                                                                   |
|  +------------------------------------+  +------------------------------------+   |
|  |  voice.ts (NEW route)              |  |  telegram.ts (NEW route/service)   |   |
|  |  POST /transcribe  (MOVED)         |  |  grammY long-poll process          |   |
|  |  POST /synthesize  (NEW)           |  |  text + voice relay                |   |
|  +------------------------------------+  +------------------------------------+   |
|                    |                                    |                         |
|  +-----------------v--------------------------------------------v--------------+ |
|  |                    voicePipelineService (NEW — core)                          | |
|  |  transcribe(audioBuffer, format) -> string                                   | |
|  |  synthesize(text, voiceId?) -> Buffer (WAV)                                  | |
|  |  formatForVoice(text) -> { voice: string, full: string }                     | |
|  +------------------------------------------------------------------------------+ |
|                    |                                                               |
|  +-----------------v--------------------------------------------------------------+|
|  |               chatService / nexusSettingsService (EXISTING)                   ||
|  |   conversations . messages . stream SSE . memory . voiceEnabled               ||
|  +--------------------------------------------------------------------------------+|
|                    |                                                               |
|  +-----------------v--------------------------------------------------------------+|
|  |         External Processes (spawned via child_process.spawn / execFile)       ||
|  |   whisper-cpp / whisper (STT)          piper (TTS)                            ||
|  +--------------------------------------------------------------------------------+|
+-----------------------------------------------------------------------------------+
         ^
         | Telegram Bot API (HTTPS long-poll)
+--------+------------------------------------------------------------------------+
|                        Telegram (external service)                               |
|  User sends text -> bot relays to chatService -> SSE reply -> bot sends back     |
|  User sends voice -> bot downloads OGG -> voicePipelineService.transcribe()      |
|                    -> chatService -> reply -> voicePipelineService.synthesize()  |
|                    -> bot sends OGG audio reply                                  |
+----------------------------------------------------------------------------------+

Integration Points: New vs. Existing

What Stays Unchanged

Component	Location	Status
`chatService`	`server/src/services/chat.ts`	No changes — voice pipeline uses it as-is
`nexusSettingsService`	`server/src/services/nexus-settings.ts`	Extend schema only (add `voiceMode`, `telegramToken`)
`chatFileRoutes`	`server/src/routes/chat-files.ts`	`/transcribe` moves out; file upload stays
`usePiperTts`	`ui/src/hooks/usePiperTts.ts`	No changes — TtsButton continues using browser WASM
`TtsButton`	`ui/src/components/TtsButton.tsx`	Add auto-play prop only
SSE stream endpoint	`server/src/routes/chat.ts`	No changes — Telegram bridge calls services directly
DB schema	`packages/db`	No changes — voice is file/process, not a DB column

What Changes (MODIFIED)

Component	Location	Change
`VoiceRecordButton`	`ui/src/components/VoiceRecordButton.tsx`	Add silence detection, waveform data emission, auto-send on silence
`ChatInput`	`ui/src/components/ChatInput.tsx`	Wire new VoiceMicButton, add voice mode prop
`ChatMessage`	`ui/src/components/ChatMessage.tsx`	Show voice_mode badge, show dual output collapse/expand
`nexusSettingsSchema`	`server/src/services/nexus-settings.ts`	Add `voiceMode` enum and `telegramToken` optional string
`app.ts`	`server/src/app.ts`	Register `voiceRoutes`, `telegramRoutes`
`createMessageSchema`	`packages/shared/src/validators/chat.ts`	Add `voiceMode: z.boolean().optional()` flag on messages
`ChatMessage` type	`packages/shared/src/types/chat.ts`	Add `voiceMode: boolean
`chat-files.ts`	`server/src/routes/chat-files.ts`	Remove `/transcribe` handler (moved to voice.ts)

What Is New (NEW)

Component	Location	Purpose
`voicePipelineService`	`server/src/services/voice-pipeline.ts`	Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge
`voice.ts` (route)	`server/src/routes/voice.ts`	`POST /api/transcribe`, `POST /api/synthesize` — thin HTTP wrappers
`telegram.ts` (service)	`server/src/services/telegram.ts`	grammY bot init, long-poll loop, message relay, voice relay
`telegram.ts` (route)	`server/src/routes/telegram.ts`	`GET /api/telegram/status`, `POST /api/telegram/token` management endpoints
`VoiceMicButton`	`ui/src/components/VoiceMicButton.tsx`	Enhanced mic button with silence detection and waveform display
`WaveformDisplay`	`ui/src/components/WaveformDisplay.tsx`	Animated audio waveform bars using AnalyserNode
`VoiceModeToggle`	`ui/src/components/VoiceModeToggle.tsx`	Three-state toggle: text only / voice input / full voice
`useVoiceMode`	`ui/src/hooks/useVoiceMode.ts`	Reads/writes voice mode setting via `/api/nexus-settings`
`useSilenceDetection`	`ui/src/hooks/useSilenceDetection.ts`	Web Audio API AnalyserNode watching for 1.5s silence threshold

Component Boundaries

voicePipelineService (Core)

This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other.

Method	Input	Output	Implementation
`transcribe(buffer, format)`	`Buffer`, `"webm" or "ogg"`	`Promise<string>`	Writes temp file, uses `execFile` (not `exec`) to spawn `whisper-cpp` or `whisper` CLI, reads stdout, cleans up
`synthesize(text, voiceId?)`	`string`, optional voiceId	`Promise<Buffer>`	Spawns `piper` CLI via `spawn`, pipes text to stdin, collects WAV stdout
`formatForVoice(text)`	`string`	`{ voice: string; full: string }`	Strips code blocks and markdown for voice; returns both variants

The transcribe method extends the existing /transcribe implementation from chat-files.ts by adding an ogg format path alongside the existing webm path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved.

Why a dedicated service vs. inline in routes: The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation.

telegram service

A thin relay, not a feature-rich bot. It:

Holds a single grammY Bot instance, initialized when telegramToken is set in nexus-settings
Routes text messages to chatService.addMessage() then collects AI response via puterProxyService.chatStream()
Routes voice messages — downloads OGG file, calls voicePipelineService.transcribe(), then same text path
If voiceMode === "full_voice": calls voicePipelineService.synthesize(), sends audio back via ctx.replyWithAudio()
Prefixes agent name on replies: [Agent Name]: message text

No per-user conversation tracking. All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md.

Voice Route vs. Chat Files Route

The existing /transcribe endpoint lives inside chatFileRoutes in chat-files.ts. For v1.6, the endpoint moves to a dedicated voice.ts route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file.

Moving the handler reduces merge conflict surface on future upstream rebases of chat-files.ts.

Recommended Project Structure

server/src/
  app.ts                         # MODIFY: register voiceRoutes, telegramRoutes
  routes/
    chat-files.ts                # MODIFY: remove /transcribe handler (moved to voice.ts)
    voice.ts                     # NEW: POST /transcribe, POST /synthesize
    nexus-settings.ts            # MODIFY: expose voiceMode + telegramToken fields
    telegram.ts                  # NEW: GET /telegram/status, POST /telegram/token
  services/
    voice-pipeline.ts            # NEW: transcribe(), synthesize(), formatForVoice()
    telegram.ts                  # NEW: grammY bot lifecycle + relay logic
    nexus-settings.ts            # MODIFY: add voiceMode + telegramToken to schema

ui/src/
  components/
    VoiceMicButton.tsx           # NEW: replaces VoiceRecordButton in ChatInput
    WaveformDisplay.tsx          # NEW: animated bars from AnalyserNode data
    VoiceModeToggle.tsx          # NEW: 3-state toggle (text / voice-in / full-voice)
    VoiceRecordButton.tsx        # KEEP as-is (still used in file upload contexts)
    TtsButton.tsx                # MODIFY: add autoPlay prop
    ChatInput.tsx                # MODIFY: add VoiceModeToggle, swap in VoiceMicButton
    ChatMessage.tsx              # MODIFY: voice_mode badge + dual output expand
  hooks/
    useVoiceMode.ts              # NEW: reads/writes voiceMode setting
    useSilenceDetection.ts       # NEW: AnalyserNode silence threshold
    usePiperTts.ts               # KEEP as-is (browser-side TTS unchanged)

packages/shared/src/
  validators/chat.ts             # MODIFY: add voiceMode flag to createMessageSchema
  types/chat.ts                  # MODIFY: add voiceMode field to ChatMessage

Architectural Patterns

Pattern 1: Transport-Agnostic Voice Service

What: A server service (voicePipelineService) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly.

When to use: Any time two transports (web + bot) need the same capability.

Trade-offs: Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently.

Shape:

// server/src/services/voice-pipeline.ts
export function voicePipelineService() {
  // Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern
  async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise<string>;
  async function synthesize(text: string, voiceId?: string): Promise<Buffer>;
  function formatForVoice(text: string): { voice: string; full: string };
  return { transcribe, synthesize, formatForVoice };
}

The existing /transcribe handler in chat-files.ts already uses promisify(execFile) — this pattern is the right model. The service wraps it with format selection (webm vs ogg) and the same whisper-cpp → openai-whisper cascade.

Pattern 2: Thin Telegram Relay

What: The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram.

When to use: Building a disposable bridge that will be replaced by a richer implementation later.

Trade-offs: No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope.

Shape:

// server/src/services/telegram.ts
import { Bot } from "grammy";

export function telegramService(db: Db) {
  let bot: Bot | null = null;

  function start(token: string): void; // idempotent, long-poll
  function stop(): void;
  function isRunning(): boolean;

  return { start, stop, isRunning };
}

The bot calls chatService(db) and puterProxyService(db) directly — no HTTP round-trip to the same server.

Pattern 3: Voice Mode Flag on Messages

What: Each message carries an optional voiceMode: boolean flag. When true, the server formats the response for voice (dual output: voice + full), and the client auto-plays TTS and shows the full text in a collapsible block.

When to use: Differentiating voice-initiated messages from text messages within the same conversation.

Trade-offs: Adds a field to createMessageSchema and the ChatMessage type. The field is optional and defaults to false, so existing messages and the upstream schema are not broken.

Schema change:

// packages/shared/src/validators/chat.ts — additive only
export const createMessageSchema = z.object({
  role: z.enum(["user", "assistant", "system"]),
  content: z.string().min(1).max(100_000),
  agentId: z.string().uuid().optional(),
  messageType: z.string().optional(),
  voiceMode: z.boolean().optional(),  // NEW in v1.6
});

Pattern 4: Direct Service Calls in Telegram Bridge

What: The Telegram bot does not call the Express HTTP API to get AI responses. It calls chatService(db) and puterProxyService(db) as regular TypeScript function calls within the same server process.

When to use: Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip.

Trade-offs: Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool.

Why not HTTP: A fetch("http://localhost:PORT/api/...") call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development.

Pattern 5: grammY Long-Poll for Single-User Local Deployment

What: Use grammY bot.start() (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running.

When to use: Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain.

Trade-offs: Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use.

Lifecycle:

Start: nexusSettingsService().get() finds telegramToken set → telegramService(db).start(token)
Stop: server.close() → telegramService(db).stop()
Runtime toggle: POST /api/telegram/token updates nexus-settings and calls start/stop

Data Flow

Web Voice Input Flow

User holds mic button
    |
    v
VoiceMicButton: MediaRecorder + AnalyserNode
    |
    v (silence detected after 1.5s or stop pressed)
POST /api/transcribe {audio: webm blob}
    |
    v
voice.ts route -> voicePipelineService.transcribe(buffer, "webm")
    |
    v (whisper-cpp or openai-whisper CLI via execFile)
{ text: "transcribed text" }
    |
    v
ChatInput fills textarea -> user sends (message tagged voiceMode: true)
    |
    v
POST /conversations/:id/stream -> chatService + puterProxyService
    |
    v (SSE tokens arrive)
ChatMessage with voice_mode badge + dual output (voice text + full text collapsible)
    |
    v
TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5)

Server-Side TTS Flow (POST /synthesize)

POST /api/synthesize { text, voiceId? }
    |
    v
voice.ts route -> voicePipelineService.synthesize(text)
    |
    v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout)
Response: Content-Type audio/wav, Buffer body
    |
    v
Client: new Audio(URL.createObjectURL(blob)).play()

Note: Server-side /synthesize is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side usePiperTts WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward.

Telegram Text Message Flow

Telegram user sends text
    |
    v
grammY bot.on("message:text") handler
    |
    v
telegramService: resolveOrCreateConversation(db)
    |
    v
chatService(db).addMessage(conversationId, { role: "user", content: text })
    |
    v
telegramService: collect full response via puterProxyService(db).chatStream()
    |
    v (if voiceMode !== "full_voice")
ctx.reply("[AgentName]: full_response_text")

    | (if voiceMode === "full_voice")
    v
voicePipelineService.formatForVoice(response) -> { voice, full }
ctx.reply("[AgentName]: " + full)  -- text message with full details
    |
    v
voicePipelineService.synthesize(voice) -> WAV Buffer
ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg"))

Telegram Voice Message Flow

Telegram user sends voice note (OGG Opus format)
    |
    v
grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer
    |
    v
voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text
    |
    v
(same path as Telegram text message above)

nexus-settings Schema Evolution

v1.5:  { mode, voiceEnabled }
v1.6:  { mode, voiceEnabled, voiceMode, telegramToken }

  voiceMode:     "text" | "voice_input" | "full_voice"  (default: "text")
  telegramToken: string | undefined                      (set by user via UI or POST /telegram/token)

voiceMode is a workspace-level setting (not per-agent). The three states map to:

"text": mic button transcribes to text input, TTS manual-only, Telegram text-only
"voice_input": mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out
"full_voice": mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out

Scaling Considerations

This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility.

Concern	At 1 user (target)	Notes
STT latency	whisper-cpp base.en on M4: ~1-3s	Acceptable; shows transcribing spinner
TTS latency	piper CLI on M4: ~0.3-1s for short text	<3s target met
Telegram poll	grammY `bot.start()`, 1 process	Adequate for <5,000 msgs/hour
Memory overhead	~10-20MB for polling loop	Acceptable on 16GB+ M4
Piper model	First server-side synthesize: cold start	Piper loads model into memory; subsequent calls fast

Anti-Patterns

Anti-Pattern 1: Telegram-Specific Voice Logic

What people do: Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler.

Why it's wrong: Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation.

Do this instead: All voice processing goes through voicePipelineService. The Telegram handler calls transcribe(buf, "ogg") — the service handles format differences. The web route calls transcribe(buf, "webm") — same service, different format argument.

Anti-Pattern 2: Circular HTTP Call for Telegram AI Response

What people do: Telegram bot handler calls fetch("http://localhost:PORT/api/conversations/:id/stream") to get AI responses from within the same server process.

Why it's wrong: Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running.

Do this instead: telegramService imports chatService(db) and puterProxyService(db) directly. Collect tokens from the async generator into a string, then send to Telegram as a single message.

Anti-Pattern 3: Blocking grammY on Slow CLI Processes

What people do: await synthesize() inside a bot handler with no timeout, assuming piper is always available and fast.

Why it's wrong: If the piper binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely.

Do this instead: Wrap CLI calls in a Promise.race([piperCall, timeout(8_000)]). If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode.

Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts

What people do: Leave the STT handler in chat-files.ts and call voicePipelineService from there, adding Nexus-specific logic to an upstream-sourced file.

Why it's wrong: chat-files.ts is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface.

Do this instead: Move /transcribe and /synthesize to a new voice.ts route file (Nexus-only, never in upstream). Keep chat-files.ts as close to upstream as possible.

Anti-Pattern 5: Storing Telegram Token in Database

What people do: Create a new DB table or add a column to instance_settings to store the Telegram bot token.

Why it's wrong: Any DB schema change blocks upstream rebase (migration files conflict). The nexus-settings.json file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent.

Do this instead: Store telegramToken in nexus-settings.json via the existing nexusSettingsService. Same pattern as voiceEnabled, mode.

Integration Points

External Services

Service	Integration Pattern	Notes
Telegram Bot API	grammY `bot.start()` long-polling (Node.js)	No public URL required; polling starts on server boot if token present in nexus-settings
whisper-cpp / openai-whisper	`execFile` cascade (same as existing `/transcribe`)	Format argument added: writes `.webm` or `.ogg` temp file based on input
piper TTS binary	`child_process.spawn` stdin -> stdout	Text piped to stdin; WAV or raw PCM bytes collected from stdout

Internal Boundaries

Boundary	Communication	Notes
voice route <-> voicePipelineService	Direct function call	Route is thin HTTP wrapper; all logic in service
telegram service <-> voicePipelineService	Direct function call	Same service used by both transports
telegram service <-> chatService	Direct function call	Bot calls `chatService(db)` directly — no HTTP round-trip
telegram service <-> nexusSettingsService	Direct function call	Reads `voiceMode` and `telegramToken` at start and on each message
web UI <-> voice route	REST: `POST /api/transcribe`, `POST /api/synthesize`	Web client uses browser-side piper WASM for TTS; `/synthesize` primarily for Telegram
UI VoiceModeToggle <-> nexus-settings	REST: `PATCH /api/nexus-settings`	Reads/writes `voiceMode` setting

Build Order

Based on component dependencies, the recommended build order within this milestone:

Step	Component(s)	Reason
1	`nexus-settings` schema extensions (`voiceMode`, `telegramToken`)	Everything downstream reads settings
2	`voicePipelineService`	Backs all voice. No new deps. Independently testable.
3	`voice.ts` route (`POST /transcribe`, `POST /synthesize`)	Thin wrapper. Register in `app.ts`. Move handler from chat-files.
4	`VoiceMicButton` + `WaveformDisplay` + `useSilenceDetection`	Pure UI. Depends only on `/transcribe`.
5	`VoiceModeToggle` + `useVoiceMode`	Depends on `voiceMode` in nexus-settings schema (Step 1).
6	`ChatMessage` dual output	Depends on `voiceMode` in shared `ChatMessage` type.
7	`createMessageSchema` + `ChatMessage` type (`voiceMode` flag)	Shared package change. Required by Steps 5-6. Could move earlier.
8	`telegramService`	Depends on voicePipelineService (2), chatService (existing), nexusSettings (1).
9	`telegram.ts` route + app.ts registration	Management endpoints. Needs telegramService.
10	Onboarding STT/TTS hardware detection step	Final: wires all voice detection into onboarding flow.

Steps 4-6 can run in parallel with Steps 7-9 if split across phases.

Sources

Direct codebase inspection: server/src/routes/chat-files.ts (lines 297-386), server/src/routes/chat.ts, server/src/services/nexus-settings.ts, server/src/app.ts, ui/src/components/VoiceRecordButton.tsx, ui/src/components/TtsButton.tsx, ui/src/hooks/usePiperTts.ts, packages/shared/src/validators/chat.ts, packages/shared/src/types/chat.ts
.planning/STATE.md — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
.planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md — existing voice implementation details, WASM TTS pattern
grammY documentation — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks
grammY deployment types guide — long polling recommended for single-user local; Express integration pattern
rhasspy/piper (archived) — CLI: echo "text" | piper --model voice.onnx -f -; development moved to OHF-Voice/piper1-gpl Oct 2025
grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed

Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6) Researched: 2026-04-03

28 KiB Raw Blame History