507 lines
28 KiB
Markdown
507 lines
28 KiB
Markdown
# Architecture Research
|
|
|
|
**Domain:** Voice Pipeline + Minimal Telegram Bridge (v1.6) — integration with existing Nexus/Paperclip monorepo
|
|
**Researched:** 2026-04-03
|
|
**Confidence:** HIGH — based on direct codebase inspection + verified current documentation
|
|
|
|
---
|
|
|
|
## System Overview
|
|
|
|
v1.6 adds two parallel capability tracks onto the existing monorepo: a transport-agnostic voice pipeline (Whisper STT + Piper TTS) and a disposable Telegram bridge that reuses those pipeline primitives for phone access. The architecture constraint is that no voice or chat logic is Telegram-specific — Telegram is an interchangeable transport layer that calls the same server services as the web UI.
|
|
|
|
```
|
|
+-----------------------------------------------------------------------------------+
|
|
| UI Layer (React/Vite) |
|
|
| |
|
|
| +-------------------------------------------------------------------------+ |
|
|
| | ChatPanel / PersonalAssistant (MODIFIED) | |
|
|
| | +---------------------+ +--------------------+ +------------------+ | |
|
|
| | | VoiceMicButton (NEW)| | WaveformDisplay | | TtsButton (v1.5) | | |
|
|
| | | silence detection | | (NEW) animated bars| | + auto-play prop | | |
|
|
| | | auto-send on silence| +--------------------+ +------------------+ | |
|
|
| | +---------------------+ | |
|
|
| | +-------------------------------------------------------------------+ | |
|
|
| | | ChatMessage (MODIFIED) — voice_mode badge, dual output toggle | | |
|
|
| | +-------------------------------------------------------------------+ | |
|
|
| | +-------------------------------------------------------------------+ | |
|
|
| | | VoiceModeToggle (NEW) — text only / voice input / full voice | | |
|
|
| | +-------------------------------------------------------------------+ | |
|
|
| +-------------------------------------------------------------------------+ |
|
|
+-----------------------------------------------------------------------------------+
|
|
| HTTP + SSE
|
|
+-----------------------------------------------------------------------------------+
|
|
| Server Layer (Express) |
|
|
| |
|
|
| +------------------------------------+ +------------------------------------+ |
|
|
| | voice.ts (NEW route) | | telegram.ts (NEW route/service) | |
|
|
| | POST /transcribe (MOVED) | | grammY long-poll process | |
|
|
| | POST /synthesize (NEW) | | text + voice relay | |
|
|
| +------------------------------------+ +------------------------------------+ |
|
|
| | | |
|
|
| +-----------------v--------------------------------------------v--------------+ |
|
|
| | voicePipelineService (NEW — core) | |
|
|
| | transcribe(audioBuffer, format) -> string | |
|
|
| | synthesize(text, voiceId?) -> Buffer (WAV) | |
|
|
| | formatForVoice(text) -> { voice: string, full: string } | |
|
|
| +------------------------------------------------------------------------------+ |
|
|
| | |
|
|
| +-----------------v--------------------------------------------------------------+|
|
|
| | chatService / nexusSettingsService (EXISTING) ||
|
|
| | conversations . messages . stream SSE . memory . voiceEnabled ||
|
|
| +--------------------------------------------------------------------------------+|
|
|
| | |
|
|
| +-----------------v--------------------------------------------------------------+|
|
|
| | External Processes (spawned via child_process.spawn / execFile) ||
|
|
| | whisper-cpp / whisper (STT) piper (TTS) ||
|
|
| +--------------------------------------------------------------------------------+|
|
|
+-----------------------------------------------------------------------------------+
|
|
^
|
|
| Telegram Bot API (HTTPS long-poll)
|
|
+--------+------------------------------------------------------------------------+
|
|
| Telegram (external service) |
|
|
| User sends text -> bot relays to chatService -> SSE reply -> bot sends back |
|
|
| User sends voice -> bot downloads OGG -> voicePipelineService.transcribe() |
|
|
| -> chatService -> reply -> voicePipelineService.synthesize() |
|
|
| -> bot sends OGG audio reply |
|
|
+----------------------------------------------------------------------------------+
|
|
```
|
|
|
|
---
|
|
|
|
## Integration Points: New vs. Existing
|
|
|
|
### What Stays Unchanged
|
|
|
|
| Component | Location | Status |
|
|
|-----------|----------|--------|
|
|
| `chatService` | `server/src/services/chat.ts` | No changes — voice pipeline uses it as-is |
|
|
| `nexusSettingsService` | `server/src/services/nexus-settings.ts` | Extend schema only (add `voiceMode`, `telegramToken`) |
|
|
| `chatFileRoutes` | `server/src/routes/chat-files.ts` | `/transcribe` moves out; file upload stays |
|
|
| `usePiperTts` | `ui/src/hooks/usePiperTts.ts` | No changes — TtsButton continues using browser WASM |
|
|
| `TtsButton` | `ui/src/components/TtsButton.tsx` | Add auto-play prop only |
|
|
| SSE stream endpoint | `server/src/routes/chat.ts` | No changes — Telegram bridge calls services directly |
|
|
| DB schema | `packages/db` | No changes — voice is file/process, not a DB column |
|
|
|
|
### What Changes (MODIFIED)
|
|
|
|
| Component | Location | Change |
|
|
|-----------|----------|--------|
|
|
| `VoiceRecordButton` | `ui/src/components/VoiceRecordButton.tsx` | Add silence detection, waveform data emission, auto-send on silence |
|
|
| `ChatInput` | `ui/src/components/ChatInput.tsx` | Wire new VoiceMicButton, add voice mode prop |
|
|
| `ChatMessage` | `ui/src/components/ChatMessage.tsx` | Show voice_mode badge, show dual output collapse/expand |
|
|
| `nexusSettingsSchema` | `server/src/services/nexus-settings.ts` | Add `voiceMode` enum and `telegramToken` optional string |
|
|
| `app.ts` | `server/src/app.ts` | Register `voiceRoutes`, `telegramRoutes` |
|
|
| `createMessageSchema` | `packages/shared/src/validators/chat.ts` | Add `voiceMode: z.boolean().optional()` flag on messages |
|
|
| `ChatMessage` type | `packages/shared/src/types/chat.ts` | Add `voiceMode: boolean | null` field |
|
|
| `chat-files.ts` | `server/src/routes/chat-files.ts` | Remove `/transcribe` handler (moved to voice.ts) |
|
|
|
|
### What Is New (NEW)
|
|
|
|
| Component | Location | Purpose |
|
|
|-----------|----------|---------|
|
|
| `voicePipelineService` | `server/src/services/voice-pipeline.ts` | Transport-agnostic STT/TTS core — used by web routes AND Telegram bridge |
|
|
| `voice.ts` (route) | `server/src/routes/voice.ts` | `POST /api/transcribe`, `POST /api/synthesize` — thin HTTP wrappers |
|
|
| `telegram.ts` (service) | `server/src/services/telegram.ts` | grammY bot init, long-poll loop, message relay, voice relay |
|
|
| `telegram.ts` (route) | `server/src/routes/telegram.ts` | `GET /api/telegram/status`, `POST /api/telegram/token` management endpoints |
|
|
| `VoiceMicButton` | `ui/src/components/VoiceMicButton.tsx` | Enhanced mic button with silence detection and waveform display |
|
|
| `WaveformDisplay` | `ui/src/components/WaveformDisplay.tsx` | Animated audio waveform bars using AnalyserNode |
|
|
| `VoiceModeToggle` | `ui/src/components/VoiceModeToggle.tsx` | Three-state toggle: text only / voice input / full voice |
|
|
| `useVoiceMode` | `ui/src/hooks/useVoiceMode.ts` | Reads/writes voice mode setting via `/api/nexus-settings` |
|
|
| `useSilenceDetection` | `ui/src/hooks/useSilenceDetection.ts` | Web Audio API AnalyserNode watching for 1.5s silence threshold |
|
|
|
|
---
|
|
|
|
## Component Boundaries
|
|
|
|
### voicePipelineService (Core)
|
|
|
|
This is the key abstraction for v1.6. Both the web HTTP route and the Telegram bridge call this service — neither knows about the other.
|
|
|
|
| Method | Input | Output | Implementation |
|
|
|--------|-------|--------|----------------|
|
|
| `transcribe(buffer, format)` | `Buffer`, `"webm" or "ogg"` | `Promise<string>` | Writes temp file, uses `execFile` (not `exec`) to spawn `whisper-cpp` or `whisper` CLI, reads stdout, cleans up |
|
|
| `synthesize(text, voiceId?)` | `string`, optional voiceId | `Promise<Buffer>` | Spawns `piper` CLI via `spawn`, pipes text to stdin, collects WAV stdout |
|
|
| `formatForVoice(text)` | `string` | `{ voice: string; full: string }` | Strips code blocks and markdown for voice; returns both variants |
|
|
|
|
The `transcribe` method extends the existing `/transcribe` implementation from `chat-files.ts` by adding an `ogg` format path alongside the existing `webm` path. The same cascade (whisper-cpp first, openai-whisper fallback) is preserved.
|
|
|
|
**Why a dedicated service vs. inline in routes:**
|
|
The Telegram bridge cannot call the web route (circular HTTP call within the same process). Both transports need the same logic. Extracting to a service eliminates duplication and makes both implementations testable in isolation.
|
|
|
|
### telegram service
|
|
|
|
A thin relay, not a feature-rich bot. It:
|
|
1. Holds a single grammY `Bot` instance, initialized when `telegramToken` is set in nexus-settings
|
|
2. Routes text messages to `chatService.addMessage()` then collects AI response via `puterProxyService.chatStream()`
|
|
3. Routes voice messages — downloads OGG file, calls `voicePipelineService.transcribe()`, then same text path
|
|
4. If `voiceMode === "full_voice"`: calls `voicePipelineService.synthesize()`, sends audio back via `ctx.replyWithAudio()`
|
|
5. Prefixes agent name on replies: `[Agent Name]: message text`
|
|
|
|
**No per-user conversation tracking.** All Telegram messages go to a single conversation (or create one on first use) associated with the workspace. This is the intentional "thin bridge" design — full sync is out of scope per PROJECT.md.
|
|
|
|
### Voice Route vs. Chat Files Route
|
|
|
|
The existing `/transcribe` endpoint lives inside `chatFileRoutes` in `chat-files.ts`. For v1.6, the endpoint moves to a dedicated `voice.ts` route. This is a path-preserving refactor: the endpoint behavior is unchanged, but the code now lives in a Nexus-specific file rather than inside a mostly-upstream file.
|
|
|
|
Moving the handler reduces merge conflict surface on future upstream rebases of `chat-files.ts`.
|
|
|
|
---
|
|
|
|
## Recommended Project Structure
|
|
|
|
```
|
|
server/src/
|
|
app.ts # MODIFY: register voiceRoutes, telegramRoutes
|
|
routes/
|
|
chat-files.ts # MODIFY: remove /transcribe handler (moved to voice.ts)
|
|
voice.ts # NEW: POST /transcribe, POST /synthesize
|
|
nexus-settings.ts # MODIFY: expose voiceMode + telegramToken fields
|
|
telegram.ts # NEW: GET /telegram/status, POST /telegram/token
|
|
services/
|
|
voice-pipeline.ts # NEW: transcribe(), synthesize(), formatForVoice()
|
|
telegram.ts # NEW: grammY bot lifecycle + relay logic
|
|
nexus-settings.ts # MODIFY: add voiceMode + telegramToken to schema
|
|
|
|
ui/src/
|
|
components/
|
|
VoiceMicButton.tsx # NEW: replaces VoiceRecordButton in ChatInput
|
|
WaveformDisplay.tsx # NEW: animated bars from AnalyserNode data
|
|
VoiceModeToggle.tsx # NEW: 3-state toggle (text / voice-in / full-voice)
|
|
VoiceRecordButton.tsx # KEEP as-is (still used in file upload contexts)
|
|
TtsButton.tsx # MODIFY: add autoPlay prop
|
|
ChatInput.tsx # MODIFY: add VoiceModeToggle, swap in VoiceMicButton
|
|
ChatMessage.tsx # MODIFY: voice_mode badge + dual output expand
|
|
hooks/
|
|
useVoiceMode.ts # NEW: reads/writes voiceMode setting
|
|
useSilenceDetection.ts # NEW: AnalyserNode silence threshold
|
|
usePiperTts.ts # KEEP as-is (browser-side TTS unchanged)
|
|
|
|
packages/shared/src/
|
|
validators/chat.ts # MODIFY: add voiceMode flag to createMessageSchema
|
|
types/chat.ts # MODIFY: add voiceMode field to ChatMessage
|
|
```
|
|
|
|
---
|
|
|
|
## Architectural Patterns
|
|
|
|
### Pattern 1: Transport-Agnostic Voice Service
|
|
|
|
**What:** A server service (`voicePipelineService`) owns STT and TTS logic. HTTP routes and Telegram relay both call the service — neither implements STT/TTS directly.
|
|
|
|
**When to use:** Any time two transports (web + bot) need the same capability.
|
|
|
|
**Trade-offs:** Adds one indirection layer. Worth it: eliminates duplication, makes each transport testable independently.
|
|
|
|
**Shape:**
|
|
```typescript
|
|
// server/src/services/voice-pipeline.ts
|
|
export function voicePipelineService() {
|
|
// Uses execFile (not exec) — prevents shell injection, consistent with codebase pattern
|
|
async function transcribe(buffer: Buffer, format: "webm" | "ogg"): Promise<string>;
|
|
async function synthesize(text: string, voiceId?: string): Promise<Buffer>;
|
|
function formatForVoice(text: string): { voice: string; full: string };
|
|
return { transcribe, synthesize, formatForVoice };
|
|
}
|
|
```
|
|
|
|
The existing `/transcribe` handler in `chat-files.ts` already uses `promisify(execFile)` — this pattern is the right model. The service wraps it with format selection (`webm` vs `ogg`) and the same whisper-cpp → openai-whisper cascade.
|
|
|
|
### Pattern 2: Thin Telegram Relay
|
|
|
|
**What:** The Telegram bot is a relay, not a first-class UI. It translates Telegram message events into the same chatService calls the web UI makes, then sends the response back via Telegram.
|
|
|
|
**When to use:** Building a disposable bridge that will be replaced by a richer implementation later.
|
|
|
|
**Trade-offs:** No rich UI (no inline keyboards, no threading). Acceptable: PROJECT.md explicitly calls out "thin bridge only" and "Telegram threads/topics/inline keyboards" are out of scope.
|
|
|
|
**Shape:**
|
|
```typescript
|
|
// server/src/services/telegram.ts
|
|
import { Bot } from "grammy";
|
|
|
|
export function telegramService(db: Db) {
|
|
let bot: Bot | null = null;
|
|
|
|
function start(token: string): void; // idempotent, long-poll
|
|
function stop(): void;
|
|
function isRunning(): boolean;
|
|
|
|
return { start, stop, isRunning };
|
|
}
|
|
```
|
|
|
|
The bot calls `chatService(db)` and `puterProxyService(db)` directly — no HTTP round-trip to the same server.
|
|
|
|
### Pattern 3: Voice Mode Flag on Messages
|
|
|
|
**What:** Each message carries an optional `voiceMode: boolean` flag. When `true`, the server formats the response for voice (dual output: `voice` + `full`), and the client auto-plays TTS and shows the full text in a collapsible block.
|
|
|
|
**When to use:** Differentiating voice-initiated messages from text messages within the same conversation.
|
|
|
|
**Trade-offs:** Adds a field to `createMessageSchema` and the `ChatMessage` type. The field is optional and defaults to `false`, so existing messages and the upstream schema are not broken.
|
|
|
|
**Schema change:**
|
|
```typescript
|
|
// packages/shared/src/validators/chat.ts — additive only
|
|
export const createMessageSchema = z.object({
|
|
role: z.enum(["user", "assistant", "system"]),
|
|
content: z.string().min(1).max(100_000),
|
|
agentId: z.string().uuid().optional(),
|
|
messageType: z.string().optional(),
|
|
voiceMode: z.boolean().optional(), // NEW in v1.6
|
|
});
|
|
```
|
|
|
|
### Pattern 4: Direct Service Calls in Telegram Bridge
|
|
|
|
**What:** The Telegram bot does not call the Express HTTP API to get AI responses. It calls `chatService(db)` and `puterProxyService(db)` as regular TypeScript function calls within the same server process.
|
|
|
|
**When to use:** Any time a server-side integration needs the same AI response capability as the web UI without an HTTP round-trip.
|
|
|
|
**Trade-offs:** Telegram handler and web handler share the same in-process service instances. If chatService has connection pooling issues, both paths are affected. This is acceptable — single-user deployment, same DB connection pool.
|
|
|
|
**Why not HTTP:** A `fetch("http://localhost:PORT/api/...")` call from within the same server requires auth token injection, port discovery, and creates circular request chains that are hard to test and fragile in development.
|
|
|
|
### Pattern 5: grammY Long-Poll for Single-User Local Deployment
|
|
|
|
**What:** Use grammY `bot.start()` (long polling) rather than webhooks. The bot polls Telegram for new messages continuously while the server is running.
|
|
|
|
**When to use:** Local single-user deployments where a public HTTPS endpoint is not available. No reverse proxy needed, no SSL cert, no domain.
|
|
|
|
**Trade-offs:** Long polling is slightly less efficient than webhooks (Telegram must respond to each poll request) but functionally equivalent for <5,000 messages/hour. Fine for personal use.
|
|
|
|
**Lifecycle:**
|
|
- Start: `nexusSettingsService().get()` finds `telegramToken` set → `telegramService(db).start(token)`
|
|
- Stop: `server.close()` → `telegramService(db).stop()`
|
|
- Runtime toggle: `POST /api/telegram/token` updates nexus-settings and calls start/stop
|
|
|
|
---
|
|
|
|
## Data Flow
|
|
|
|
### Web Voice Input Flow
|
|
|
|
```
|
|
User holds mic button
|
|
|
|
|
v
|
|
VoiceMicButton: MediaRecorder + AnalyserNode
|
|
|
|
|
v (silence detected after 1.5s or stop pressed)
|
|
POST /api/transcribe {audio: webm blob}
|
|
|
|
|
v
|
|
voice.ts route -> voicePipelineService.transcribe(buffer, "webm")
|
|
|
|
|
v (whisper-cpp or openai-whisper CLI via execFile)
|
|
{ text: "transcribed text" }
|
|
|
|
|
v
|
|
ChatInput fills textarea -> user sends (message tagged voiceMode: true)
|
|
|
|
|
v
|
|
POST /conversations/:id/stream -> chatService + puterProxyService
|
|
|
|
|
v (SSE tokens arrive)
|
|
ChatMessage with voice_mode badge + dual output (voice text + full text collapsible)
|
|
|
|
|
v
|
|
TtsButton auto-plays (browser-side piper-tts-web WASM — unchanged from v1.5)
|
|
```
|
|
|
|
### Server-Side TTS Flow (POST /synthesize)
|
|
|
|
```
|
|
POST /api/synthesize { text, voiceId? }
|
|
|
|
|
v
|
|
voice.ts route -> voicePipelineService.synthesize(text)
|
|
|
|
|
v (piper CLI via spawn: text -> stdin, WAV bytes <- stdout)
|
|
Response: Content-Type audio/wav, Buffer body
|
|
|
|
|
v
|
|
Client: new Audio(URL.createObjectURL(blob)).play()
|
|
```
|
|
|
|
Note: Server-side `/synthesize` is new in v1.6. Its primary consumer is the Telegram bridge (which cannot use browser WASM). Web chat continues using browser-side `usePiperTts` WASM (v1.5 unchanged). The route is available for headless/server scenarios going forward.
|
|
|
|
### Telegram Text Message Flow
|
|
|
|
```
|
|
Telegram user sends text
|
|
|
|
|
v
|
|
grammY bot.on("message:text") handler
|
|
|
|
|
v
|
|
telegramService: resolveOrCreateConversation(db)
|
|
|
|
|
v
|
|
chatService(db).addMessage(conversationId, { role: "user", content: text })
|
|
|
|
|
v
|
|
telegramService: collect full response via puterProxyService(db).chatStream()
|
|
|
|
|
v (if voiceMode !== "full_voice")
|
|
ctx.reply("[AgentName]: full_response_text")
|
|
|
|
| (if voiceMode === "full_voice")
|
|
v
|
|
voicePipelineService.formatForVoice(response) -> { voice, full }
|
|
ctx.reply("[AgentName]: " + full) -- text message with full details
|
|
|
|
|
v
|
|
voicePipelineService.synthesize(voice) -> WAV Buffer
|
|
ctx.replyWithAudio(InputFile(wavBuffer, "reply.ogg"))
|
|
```
|
|
|
|
### Telegram Voice Message Flow
|
|
|
|
```
|
|
Telegram user sends voice note (OGG Opus format)
|
|
|
|
|
v
|
|
grammY bot.on("message:voice") -> ctx.getFile() -> download Buffer
|
|
|
|
|
v
|
|
voicePipelineService.transcribe(buffer, "ogg") -> whisper CLI -> text
|
|
|
|
|
v
|
|
(same path as Telegram text message above)
|
|
```
|
|
|
|
### nexus-settings Schema Evolution
|
|
|
|
```
|
|
v1.5: { mode, voiceEnabled }
|
|
v1.6: { mode, voiceEnabled, voiceMode, telegramToken }
|
|
|
|
voiceMode: "text" | "voice_input" | "full_voice" (default: "text")
|
|
telegramToken: string | undefined (set by user via UI or POST /telegram/token)
|
|
```
|
|
|
|
`voiceMode` is a workspace-level setting (not per-agent). The three states map to:
|
|
- `"text"`: mic button transcribes to text input, TTS manual-only, Telegram text-only
|
|
- `"voice_input"`: mic transcribes and auto-sends, TTS manual-only, Telegram voice-in + text-out
|
|
- `"full_voice"`: mic auto-sends, TTS auto-plays on every response, Telegram voice-in + voice-out
|
|
|
|
---
|
|
|
|
## Scaling Considerations
|
|
|
|
This system targets a single user on Mac Mini M4 throughout its lifetime. Scaling is not a concern. The architecture is optimized for simplicity and upstream merge compatibility.
|
|
|
|
| Concern | At 1 user (target) | Notes |
|
|
|---------|-------------------|-------|
|
|
| STT latency | whisper-cpp base.en on M4: ~1-3s | Acceptable; shows transcribing spinner |
|
|
| TTS latency | piper CLI on M4: ~0.3-1s for short text | <3s target met |
|
|
| Telegram poll | grammY `bot.start()`, 1 process | Adequate for <5,000 msgs/hour |
|
|
| Memory overhead | ~10-20MB for polling loop | Acceptable on 16GB+ M4 |
|
|
| Piper model | First server-side synthesize: cold start | Piper loads model into memory; subsequent calls fast |
|
|
|
|
---
|
|
|
|
## Anti-Patterns
|
|
|
|
### Anti-Pattern 1: Telegram-Specific Voice Logic
|
|
|
|
**What people do:** Implement OGG-to-text and text-to-OGG directly inside the Telegram bot handler.
|
|
|
|
**Why it's wrong:** Creates two separate STT/TTS code paths that diverge over time. Voice bugs must be fixed in two places. Untestable in isolation.
|
|
|
|
**Do this instead:** All voice processing goes through `voicePipelineService`. The Telegram handler calls `transcribe(buf, "ogg")` — the service handles format differences. The web route calls `transcribe(buf, "webm")` — same service, different format argument.
|
|
|
|
### Anti-Pattern 2: Circular HTTP Call for Telegram AI Response
|
|
|
|
**What people do:** Telegram bot handler calls `fetch("http://localhost:PORT/api/conversations/:id/stream")` to get AI responses from within the same server process.
|
|
|
|
**Why it's wrong:** Requires auth token injection. Fragile (port discovery). Extra TCP round-trip. Fails in test environments where the HTTP server may not be running.
|
|
|
|
**Do this instead:** `telegramService` imports `chatService(db)` and `puterProxyService(db)` directly. Collect tokens from the async generator into a string, then send to Telegram as a single message.
|
|
|
|
### Anti-Pattern 3: Blocking grammY on Slow CLI Processes
|
|
|
|
**What people do:** `await synthesize()` inside a bot handler with no timeout, assuming piper is always available and fast.
|
|
|
|
**Why it's wrong:** If the `piper` binary is not installed or hangs, the grammY update queue stalls. The same update gets retried indefinitely.
|
|
|
|
**Do this instead:** Wrap CLI calls in a `Promise.race([piperCall, timeout(8_000)])`. If piper times out or is not installed, fall back to text-only reply and log the failure. Bot degrades gracefully to text mode.
|
|
|
|
### Anti-Pattern 4: Keeping /transcribe Inside chat-files.ts
|
|
|
|
**What people do:** Leave the STT handler in `chat-files.ts` and call `voicePipelineService` from there, adding Nexus-specific logic to an upstream-sourced file.
|
|
|
|
**Why it's wrong:** `chat-files.ts` is a mostly-upstream Paperclip file. Each rebase introduces merge conflicts. More Nexus-specific code in the file = more conflict surface.
|
|
|
|
**Do this instead:** Move `/transcribe` and `/synthesize` to a new `voice.ts` route file (Nexus-only, never in upstream). Keep `chat-files.ts` as close to upstream as possible.
|
|
|
|
### Anti-Pattern 5: Storing Telegram Token in Database
|
|
|
|
**What people do:** Create a new DB table or add a column to `instance_settings` to store the Telegram bot token.
|
|
|
|
**Why it's wrong:** Any DB schema change blocks upstream rebase (migration files conflict). The `nexus-settings.json` file-backed service is the established Nexus pattern for project-specific config that has no upstream equivalent.
|
|
|
|
**Do this instead:** Store `telegramToken` in `nexus-settings.json` via the existing `nexusSettingsService`. Same pattern as `voiceEnabled`, `mode`.
|
|
|
|
---
|
|
|
|
## Integration Points
|
|
|
|
### External Services
|
|
|
|
| Service | Integration Pattern | Notes |
|
|
|---------|---------------------|-------|
|
|
| Telegram Bot API | grammY `bot.start()` long-polling (Node.js) | No public URL required; polling starts on server boot if token present in nexus-settings |
|
|
| whisper-cpp / openai-whisper | `execFile` cascade (same as existing `/transcribe`) | Format argument added: writes `.webm` or `.ogg` temp file based on input |
|
|
| piper TTS binary | `child_process.spawn` stdin -> stdout | Text piped to stdin; WAV or raw PCM bytes collected from stdout |
|
|
|
|
### Internal Boundaries
|
|
|
|
| Boundary | Communication | Notes |
|
|
|----------|---------------|-------|
|
|
| voice route <-> voicePipelineService | Direct function call | Route is thin HTTP wrapper; all logic in service |
|
|
| telegram service <-> voicePipelineService | Direct function call | Same service used by both transports |
|
|
| telegram service <-> chatService | Direct function call | Bot calls `chatService(db)` directly — no HTTP round-trip |
|
|
| telegram service <-> nexusSettingsService | Direct function call | Reads `voiceMode` and `telegramToken` at start and on each message |
|
|
| web UI <-> voice route | REST: `POST /api/transcribe`, `POST /api/synthesize` | Web client uses browser-side piper WASM for TTS; `/synthesize` primarily for Telegram |
|
|
| UI VoiceModeToggle <-> nexus-settings | REST: `PATCH /api/nexus-settings` | Reads/writes `voiceMode` setting |
|
|
|
|
---
|
|
|
|
## Build Order
|
|
|
|
Based on component dependencies, the recommended build order within this milestone:
|
|
|
|
| Step | Component(s) | Reason |
|
|
|------|-------------|--------|
|
|
| 1 | `nexus-settings` schema extensions (`voiceMode`, `telegramToken`) | Everything downstream reads settings |
|
|
| 2 | `voicePipelineService` | Backs all voice. No new deps. Independently testable. |
|
|
| 3 | `voice.ts` route (`POST /transcribe`, `POST /synthesize`) | Thin wrapper. Register in `app.ts`. Move handler from chat-files. |
|
|
| 4 | `VoiceMicButton` + `WaveformDisplay` + `useSilenceDetection` | Pure UI. Depends only on `/transcribe`. |
|
|
| 5 | `VoiceModeToggle` + `useVoiceMode` | Depends on `voiceMode` in nexus-settings schema (Step 1). |
|
|
| 6 | `ChatMessage` dual output | Depends on `voiceMode` in shared `ChatMessage` type. |
|
|
| 7 | `createMessageSchema` + `ChatMessage` type (`voiceMode` flag) | Shared package change. Required by Steps 5-6. Could move earlier. |
|
|
| 8 | `telegramService` | Depends on voicePipelineService (2), chatService (existing), nexusSettings (1). |
|
|
| 9 | `telegram.ts` route + app.ts registration | Management endpoints. Needs telegramService. |
|
|
| 10 | Onboarding STT/TTS hardware detection step | Final: wires all voice detection into onboarding flow. |
|
|
|
|
Steps 4-6 can run in parallel with Steps 7-9 if split across phases.
|
|
|
|
---
|
|
|
|
## Sources
|
|
|
|
- Direct codebase inspection: `server/src/routes/chat-files.ts` (lines 297-386), `server/src/routes/chat.ts`, `server/src/services/nexus-settings.ts`, `server/src/app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `ui/src/components/TtsButton.tsx`, `ui/src/hooks/usePiperTts.ts`, `packages/shared/src/validators/chat.ts`, `packages/shared/src/types/chat.ts`
|
|
- `.planning/STATE.md` — v1.6 architectural decisions (transport-agnostic, disposable bridge, dual output, per-message flag)
|
|
- `.planning/milestones/v1.5-phases/34-voice/34-RESEARCH.md` — existing voice implementation details, WASM TTS pattern
|
|
- [grammY documentation](https://grammy.dev/) — TypeScript-native, Bot API 9.6 (April 2026), long-polling vs webhooks
|
|
- [grammY deployment types guide](https://grammy.dev/guide/deployment-types) — long polling recommended for single-user local; Express integration pattern
|
|
- [rhasspy/piper (archived)](https://github.com/rhasspy/piper) — CLI: `echo "text" | piper --model voice.onnx -f -`; development moved to OHF-Voice/piper1-gpl Oct 2025
|
|
- grammY supports Telegram Bot API 9.6 (released April 3, 2026) — latest version confirmed
|
|
|
|
---
|
|
*Architecture research for: Voice Pipeline + Minimal Telegram Bridge (v1.6)*
|
|
*Researched: 2026-04-03*
|