# Technology Stack: v1.6 Voice Pipeline + Telegram Bridge **Project:** Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client) **Researched:** 2026-04-03 **Scope:** NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge **Confidence:** MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified) --- ## Context: What v1.5 Already Installed Do not re-add or re-research these — they are in `server/package.json` or `ui/package.json`: | Package | Location | Purpose | |---------|----------|---------| | `smart-whisper ^0.8.1` | `server/` | Whisper.cpp Node bindings (recommended in v1.5 STACK.md) | | `@mintplex-labs/piper-tts-web ^1.0.4` | `ui/` | Browser-side Piper WASM (already installed) | | `systeminformation 5` | `server/` | Hardware detection | | `multer ^2.0.2` | `server/` | Multipart upload (already handles audio blob uploads) | | `express ^5.1.0` | `server/` | HTTP server | The existing `VoiceRecordButton` already uses `MediaRecorder` + `POST /api/transcribe`. The existing `usePiperTts` hook already uses `@mintplex-labs/piper-tts-web` for browser-side TTS. The v1.6 work **extends** this — adding silence detection, server-side TTS, and Telegram relay. --- ## New Libraries by Feature Area ### 1. Browser VAD (Silence Detection + Auto-Send) **Package:** `@ricky0123/vad-react` **Version:** `^0.0.36` **Where it lives:** `ui/` only — browser-side ONNX model running off the main thread **Why:** The existing `VoiceRecordButton` requires the user to manually tap Stop. `@ricky0123/vad-react` uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires `onSpeechEnd` automatically with the speech segment as a `Float32Array` at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the `userSpeaking` state flag. **React 19 compatibility:** Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No `--legacy-peer-deps` needed. **API surface:** ```typescript import { useMicVAD } from "@ricky0123/vad-react"; const vad = useMicVAD({ startOnLoad: false, // user must explicitly start positiveSpeechThreshold: 0.3, // sensitivity minSpeechMs: 400, // ignore sub-400ms blips redemptionMs: 1400, // 1.4s silence = end of utterance onSpeechEnd: (audio: Float32Array) => { // audio is 16kHz Float32Array — matches what Whisper expects sendToTranscribeEndpoint(float32ToWav(audio)); }, }); // vad.userSpeaking — boolean for waveform animation // vad.listening — boolean for mic state // vad.start() / vad.pause() ``` **Key integration note:** `onSpeechEnd` delivers a `Float32Array` at 16000Hz — this maps directly to what `smart-whisper` expects on the server side, so no resampling is needed in the browser-to-server path. **Confidence: MEDIUM** — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load. --- ### 2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG) **Package:** `ffmpeg-static` **Version:** `^5.2.0` (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows) **Where it lives:** `server/` — provides the binary path; invoked via Node.js `child_process.spawn` **Why `ffmpeg-static` over alternatives:** - `fluent-ffmpeg` was archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency - `@ffmpeg-installer/ffmpeg` — last updated 2022, stale binary (FFmpeg 4.x) - `ffmpeg-static` — actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed) - Direct `child_process.spawn("ffmpeg", [...])` with the binary path from `ffmpeg-static` is the recommended approach for 2025+ **Two conversions needed:** **a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)** ```typescript import ffmpegPath from "ffmpeg-static"; import { spawn } from "node:child_process"; function webmToWav16k(inputBuffer: Buffer): Promise { return new Promise((resolve, reject) => { const proc = spawn(ffmpegPath!, [ "-i", "pipe:0", // read from stdin "-acodec", "pcm_s16le", "-ac", "1", // mono "-ar", "16000", // 16kHz "-f", "wav", "pipe:1", // write to stdout ]); const out: Buffer[] = []; proc.stdout.on("data", (c: Buffer) => out.push(c)); proc.stdout.on("end", () => resolve(Buffer.concat(out))); proc.stderr.on("data", () => {}); // suppress ffmpeg banner proc.on("error", reject); proc.stdin.write(inputBuffer); proc.stdin.end(); }); } ``` **b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)** ```typescript function wavToOggOpus(inputBuffer: Buffer): Promise { return new Promise((resolve, reject) => { const proc = spawn(ffmpegPath!, [ "-i", "pipe:0", "-c:a", "libopus", "-b:a", "32k", "-f", "ogg", "pipe:1", ]); // ... same pattern as above }); } ``` **Confidence: MEDIUM** — `ffmpeg-static` macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025. --- ### 3. Telegram Bridge **Package:** `grammy` **Version:** `^1.41.1` (latest, supports Bot API 9.6) **Where it lives:** `server/` as an optional singleton service — only starts if `TELEGRAM_BOT_TOKEN` is set **Why grammy over alternatives:** - `grammy` has 1.4M weekly downloads vs `telegraf` at 900K — grammY is now the higher-adoption choice - grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs - `node-telegram-bot-api` is lower-level with no middleware, requires more boilerplate for this use case - grammY's file handling API (`ctx.getFile()`) is the cleanest for the voice relay use case **What the bridge needs to do (thin relay only — per PROJECT.md):** ```typescript import { Bot, Context } from "grammy"; const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!); // Relay text messages to Nexus chat API bot.on("message:text", async (ctx) => { const response = await relayToNexus(ctx.message.text, ctx.from.id); await ctx.reply(response); }); // Receive voice messages — download OGG, transcribe, relay bot.on("message:voice", async (ctx) => { const file = await ctx.getFile(); // file.download() returns Buffer (grammY handles temp URL expiry) const oggBuffer = await downloadFile(file.file_path!, bot.token); const transcript = await transcribeOgg(oggBuffer); // via smart-whisper const response = await relayToNexus(transcript, ctx.from.id); await ctx.reply(response); }); // Run with long polling (no webhook needed for single-user local setup) bot.start(); ``` **Voice message format from Telegram:** Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with `ffmpeg-static` pipeline: `ogg→wav16k`. **To send TTS back to Telegram:** Convert Piper WAV output → OGG Opus via `ffmpeg-static`, then use `ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg"))`. **Long polling vs webhook:** Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed. **Confidence: HIGH** — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge. --- ### 4. Server-Side Piper TTS (Audio Response Endpoint) **No new library needed.** The v1.5 STACK.md already specified the `child_process.spawn` approach with the Piper binary. **What v1.6 adds on top of v1.5:** - A new Express route: `POST /api/voice/synthesize` that accepts `{ text, voice? }` and returns raw WAV audio (`Content-Type: audio/wav`) - This endpoint is used by both the web chat playback (browser `