16 KiB
Technology Stack: v1.6 Voice Pipeline + Telegram Bridge
Project: Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client) Researched: 2026-04-03 Scope: NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge Confidence: MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified)
Context: What v1.5 Already Installed
Do not re-add or re-research these — they are in server/package.json or ui/package.json:
| Package | Location | Purpose |
|---|---|---|
smart-whisper ^0.8.1 |
server/ |
Whisper.cpp Node bindings (recommended in v1.5 STACK.md) |
@mintplex-labs/piper-tts-web ^1.0.4 |
ui/ |
Browser-side Piper WASM (already installed) |
systeminformation 5 |
server/ |
Hardware detection |
multer ^2.0.2 |
server/ |
Multipart upload (already handles audio blob uploads) |
express ^5.1.0 |
server/ |
HTTP server |
The existing VoiceRecordButton already uses MediaRecorder + POST /api/transcribe. The existing usePiperTts hook already uses @mintplex-labs/piper-tts-web for browser-side TTS. The v1.6 work extends this — adding silence detection, server-side TTS, and Telegram relay.
New Libraries by Feature Area
1. Browser VAD (Silence Detection + Auto-Send)
Package: @ricky0123/vad-react
Version: ^0.0.36
Where it lives: ui/ only — browser-side ONNX model running off the main thread
Why: The existing VoiceRecordButton requires the user to manually tap Stop. @ricky0123/vad-react uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires onSpeechEnd automatically with the speech segment as a Float32Array at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the userSpeaking state flag.
React 19 compatibility: Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No --legacy-peer-deps needed.
API surface:
import { useMicVAD } from "@ricky0123/vad-react";
const vad = useMicVAD({
startOnLoad: false, // user must explicitly start
positiveSpeechThreshold: 0.3, // sensitivity
minSpeechMs: 400, // ignore sub-400ms blips
redemptionMs: 1400, // 1.4s silence = end of utterance
onSpeechEnd: (audio: Float32Array) => {
// audio is 16kHz Float32Array — matches what Whisper expects
sendToTranscribeEndpoint(float32ToWav(audio));
},
});
// vad.userSpeaking — boolean for waveform animation
// vad.listening — boolean for mic state
// vad.start() / vad.pause()
Key integration note: onSpeechEnd delivers a Float32Array at 16000Hz — this maps directly to what smart-whisper expects on the server side, so no resampling is needed in the browser-to-server path.
Confidence: MEDIUM — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load.
2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG)
Package: ffmpeg-static
Version: ^5.2.0 (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows)
Where it lives: server/ — provides the binary path; invoked via Node.js child_process.spawn
Why ffmpeg-static over alternatives:
fluent-ffmpegwas archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency@ffmpeg-installer/ffmpeg— last updated 2022, stale binary (FFmpeg 4.x)ffmpeg-static— actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed)- Direct
child_process.spawn("ffmpeg", [...])with the binary path fromffmpeg-staticis the recommended approach for 2025+
Two conversions needed:
a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)
import ffmpegPath from "ffmpeg-static";
import { spawn } from "node:child_process";
function webmToWav16k(inputBuffer: Buffer): Promise<Buffer> {
return new Promise((resolve, reject) => {
const proc = spawn(ffmpegPath!, [
"-i", "pipe:0", // read from stdin
"-acodec", "pcm_s16le",
"-ac", "1", // mono
"-ar", "16000", // 16kHz
"-f", "wav",
"pipe:1", // write to stdout
]);
const out: Buffer[] = [];
proc.stdout.on("data", (c: Buffer) => out.push(c));
proc.stdout.on("end", () => resolve(Buffer.concat(out)));
proc.stderr.on("data", () => {}); // suppress ffmpeg banner
proc.on("error", reject);
proc.stdin.write(inputBuffer);
proc.stdin.end();
});
}
b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)
function wavToOggOpus(inputBuffer: Buffer): Promise<Buffer> {
return new Promise((resolve, reject) => {
const proc = spawn(ffmpegPath!, [
"-i", "pipe:0",
"-c:a", "libopus",
"-b:a", "32k",
"-f", "ogg",
"pipe:1",
]);
// ... same pattern as above
});
}
Confidence: MEDIUM — ffmpeg-static macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025.
3. Telegram Bridge
Package: grammy
Version: ^1.41.1 (latest, supports Bot API 9.6)
Where it lives: server/ as an optional singleton service — only starts if TELEGRAM_BOT_TOKEN is set
Why grammy over alternatives:
grammyhas 1.4M weekly downloads vstelegrafat 900K — grammY is now the higher-adoption choice- grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs
node-telegram-bot-apiis lower-level with no middleware, requires more boilerplate for this use case- grammY's file handling API (
ctx.getFile()) is the cleanest for the voice relay use case
What the bridge needs to do (thin relay only — per PROJECT.md):
import { Bot, Context } from "grammy";
const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!);
// Relay text messages to Nexus chat API
bot.on("message:text", async (ctx) => {
const response = await relayToNexus(ctx.message.text, ctx.from.id);
await ctx.reply(response);
});
// Receive voice messages — download OGG, transcribe, relay
bot.on("message:voice", async (ctx) => {
const file = await ctx.getFile();
// file.download() returns Buffer (grammY handles temp URL expiry)
const oggBuffer = await downloadFile(file.file_path!, bot.token);
const transcript = await transcribeOgg(oggBuffer); // via smart-whisper
const response = await relayToNexus(transcript, ctx.from.id);
await ctx.reply(response);
});
// Run with long polling (no webhook needed for single-user local setup)
bot.start();
Voice message format from Telegram: Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with ffmpeg-static pipeline: ogg→wav16k.
To send TTS back to Telegram: Convert Piper WAV output → OGG Opus via ffmpeg-static, then use ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg")).
Long polling vs webhook: Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed.
Confidence: HIGH — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge.
4. Server-Side Piper TTS (Audio Response Endpoint)
No new library needed. The v1.5 STACK.md already specified the child_process.spawn approach with the Piper binary.
What v1.6 adds on top of v1.5:
- A new Express route:
POST /api/voice/synthesizethat accepts{ text, voice? }and returns raw WAV audio (Content-Type: audio/wav) - This endpoint is used by both the web chat playback (browser
<audio>element) and the Telegram bridge (convert WAV → OGG forsendVoice) - Voice mode flag: requests with
voiceMode: trueshould receive a condensed plain-language response (no markdown, no code blocks) — this is a prompt instruction layer, not a library
Response shape:
POST /api/voice/synthesize
Body: { text: string, voice?: "en_US-lessac-medium" }
Response: audio/wav binary stream
Confidence: HIGH — this is an implementation pattern, not a new library.
5. Audio Playback (Web Chat)
No new library needed. The browser's native <audio> element handles WAV and OGG playback. The existing TtsButton uses new Audio(url) already. The v1.6 enhancement is:
- Upgrade from
new Audio(blob)to a proper inline<audio controls>player with auto-play toggle stored in settings - Use
URL.createObjectURL(blob)for streaming playback of TTS responses - Waveform visualization via
AnalyserNodefrom the Web Audio API — no library needed
Confidence: HIGH — Web Audio API and <audio> are native browser APIs. No library required.
Installation Summary
# ui/ — add VAD for silence detection + auto-send
pnpm --filter @paperclipai/ui add @ricky0123/vad-react
# server/ — add FFmpeg binary (for audio format conversion) and Telegram bot
pnpm --filter @paperclipai/server add ffmpeg-static grammy
# server/ — types (ffmpeg-static ships its own types; grammy is TS-native)
# No @types/* needed for grammy
# ffmpeg-static types are included in the package
What NOT to Add
| Avoid | Why | Use Instead |
|---|---|---|
fluent-ffmpeg |
Archived May 2025, no longer maintained | Direct child_process.spawn with ffmpeg-static binary |
@ffmpeg-installer/ffmpeg |
Stale — last updated 2022, ships FFmpeg 4.x | ffmpeg-static ^5.2.0 (ships FFmpeg 6.1.1, arm64 support) |
telegraf |
TypeScript type system "too complex to understand" per maintainers; lower weekly downloads than grammY | grammy ^1.41.1 |
node-telegram-bot-api |
Low-level, requires callback polling setup, no middleware, more boilerplate | grammy |
@ricky0123/vad-node |
Node.js support was discontinued by the maintainer; wound down | @ricky0123/vad-react (browser-only, which is where recording lives) |
whisper.js / transformers.js (browser WASM) |
200MB+ model download in browser; slow on first load; server-side Whisper via smart-whisper is already in place |
smart-whisper on server (already in v1.5 stack) |
@mintplex-labs/piper-tts-web for server TTS |
Browser WASM only, no Node.js support | Piper binary via child_process.spawn (already specified in v1.5) |
| Wake word / real-time streaming audio | Out of scope per PROJECT.md | Future milestone |
Alternatives Considered
| Recommended | Alternative | When to Use Alternative |
|---|---|---|
grammy ^1.41.1 |
telegraf ^4.x |
If you need a battle-tested library with larger plugin ecosystem and tolerate complex TypeScript types |
ffmpeg-static + spawn |
@ffmpeg/ffmpeg (WASM) |
If running in a serverless/edge environment where native binaries are not available — not applicable here |
@ricky0123/vad-react |
Manual AudioWorklet energy threshold |
If you need lower latency or don't want the 5MB ONNX WASM payload; simpler but less accurate silence detection |
@ricky0123/vad-react |
MediaRecorder with manual stop button (current impl) |
The current v1.3 VoiceRecordButton works; VAD is strictly an UX upgrade |
Version Compatibility
| Package | Compatible With | Notes |
|---|---|---|
grammy ^1.41.1 |
Node.js >=18, TypeScript >=5 | Ships its own types, no @types/grammy |
ffmpeg-static ^5.2.0 |
Node.js >=14, macOS arm64 | Downloads correct binary at npm install time via optionalDependencies |
@ricky0123/vad-react ^0.0.36 |
React 19, Vite 6 | React 19 peer dep fixed in August 2025; requires SharedArrayBuffer (COOP/COEP headers) for ONNX thread worker |
smart-whisper ^0.8.1 |
Node.js >=18, macOS arm64 | From v1.5 — verify it's actually installed before v1.6 starts |
Critical COOP/COEP note for @ricky0123/vad-react: The Silero VAD model runs in an ONNX Runtime Web worker that requires SharedArrayBuffer. This means the server must send these headers on HTML responses:
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
This is a one-line addition to the Express static file middleware. Without it, VAD silently fails in Chrome/Firefox. The existing PWA service worker may also need Cross-Origin-Embedder-Policy: require-corp to avoid breaking.
Integration Architecture (v1.6 additions only)
Browser (UI) Server (Express)
───────────────────────────────── ───────────────────────────────────────
@ricky0123/vad-react POST /api/transcribe (existing)
└── useMicVAD ─────→ └── ffmpeg-static: webm→wav16k
└── onSpeechEnd(Float32Array) └── smart-whisper: wav→text
└── userSpeaking (waveform UI) └── returns { text: string }
React ChatInput (updated) POST /api/voice/synthesize (new)
└── voice mode toggle ─────→ └── Piper binary: text→wav
└── auto-send on speech end └── returns audio/wav stream
└── <audio> inline player ←──────────────┘
Telegram bridge (new, optional)
└── grammy long polling
└── message:text → relayToNexus()
└── message:voice →
ffmpeg: ogg→wav16k
smart-whisper → text
relayToNexus() → response
Piper → wav
ffmpeg: wav→ogg
ctx.replyWithVoice()
Sources
- grammY official docs — TypeScript support, long polling, file handling confirmed
- grammY GitHub — Bot API 9.6 badge, v1.41.1 version
- grammY file handling guide —
ctx.getFile(), download pattern - grammY comparison with Telegraf — TypeScript type quality comparison
- ffmpeg-static GitHub — macOS arm64 binary confirmed, FFmpeg 6.1.1
- fluent-ffmpeg archival — archived May 22 2025, confirmed
- @ricky0123/vad-react npm — v0.0.36, last published 3 months ago
- vad React 19 support issue #188 — fixed August 28 2025, confirmed
- vad API docs —
onSpeechEndFloat32Array 16kHz confirmed - Telegram Bot API sendVoice — OGG Opus format requirement
- nodejs-whisper GitHub — v0.2.9 comparison (rejected: subprocess-based, 10 months stale)
- Piper TTS GitHub releases — macOS aarch64 binary availability
Stack research for: Nexus v1.6 Voice Pipeline + Telegram Bridge Researched: 2026-04-03 Supersedes: v1.5 STACK.md entries for smart-whisper and Piper — those remain valid; this file adds the glue and new libraries