nexus/.planning/research/STACK.md
2026-04-04 03:55:49 +00:00

320 lines
16 KiB
Markdown

# Technology Stack: v1.6 Voice Pipeline + Telegram Bridge
**Project:** Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client)
**Researched:** 2026-04-03
**Scope:** NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge
**Confidence:** MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified)
---
## Context: What v1.5 Already Installed
Do not re-add or re-research these — they are in `server/package.json` or `ui/package.json`:
| Package | Location | Purpose |
|---------|----------|---------|
| `smart-whisper ^0.8.1` | `server/` | Whisper.cpp Node bindings (recommended in v1.5 STACK.md) |
| `@mintplex-labs/piper-tts-web ^1.0.4` | `ui/` | Browser-side Piper WASM (already installed) |
| `systeminformation 5` | `server/` | Hardware detection |
| `multer ^2.0.2` | `server/` | Multipart upload (already handles audio blob uploads) |
| `express ^5.1.0` | `server/` | HTTP server |
The existing `VoiceRecordButton` already uses `MediaRecorder` + `POST /api/transcribe`. The existing `usePiperTts` hook already uses `@mintplex-labs/piper-tts-web` for browser-side TTS. The v1.6 work **extends** this — adding silence detection, server-side TTS, and Telegram relay.
---
## New Libraries by Feature Area
### 1. Browser VAD (Silence Detection + Auto-Send)
**Package:** `@ricky0123/vad-react`
**Version:** `^0.0.36`
**Where it lives:** `ui/` only — browser-side ONNX model running off the main thread
**Why:** The existing `VoiceRecordButton` requires the user to manually tap Stop. `@ricky0123/vad-react` uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires `onSpeechEnd` automatically with the speech segment as a `Float32Array` at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the `userSpeaking` state flag.
**React 19 compatibility:** Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No `--legacy-peer-deps` needed.
**API surface:**
```typescript
import { useMicVAD } from "@ricky0123/vad-react";
const vad = useMicVAD({
startOnLoad: false, // user must explicitly start
positiveSpeechThreshold: 0.3, // sensitivity
minSpeechMs: 400, // ignore sub-400ms blips
redemptionMs: 1400, // 1.4s silence = end of utterance
onSpeechEnd: (audio: Float32Array) => {
// audio is 16kHz Float32Array — matches what Whisper expects
sendToTranscribeEndpoint(float32ToWav(audio));
},
});
// vad.userSpeaking — boolean for waveform animation
// vad.listening — boolean for mic state
// vad.start() / vad.pause()
```
**Key integration note:** `onSpeechEnd` delivers a `Float32Array` at 16000Hz — this maps directly to what `smart-whisper` expects on the server side, so no resampling is needed in the browser-to-server path.
**Confidence: MEDIUM** — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load.
---
### 2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG)
**Package:** `ffmpeg-static`
**Version:** `^5.2.0` (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows)
**Where it lives:** `server/` — provides the binary path; invoked via Node.js `child_process.spawn`
**Why `ffmpeg-static` over alternatives:**
- `fluent-ffmpeg` was archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency
- `@ffmpeg-installer/ffmpeg` — last updated 2022, stale binary (FFmpeg 4.x)
- `ffmpeg-static` — actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed)
- Direct `child_process.spawn("ffmpeg", [...])` with the binary path from `ffmpeg-static` is the recommended approach for 2025+
**Two conversions needed:**
**a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)**
```typescript
import ffmpegPath from "ffmpeg-static";
import { spawn } from "node:child_process";
function webmToWav16k(inputBuffer: Buffer): Promise<Buffer> {
return new Promise((resolve, reject) => {
const proc = spawn(ffmpegPath!, [
"-i", "pipe:0", // read from stdin
"-acodec", "pcm_s16le",
"-ac", "1", // mono
"-ar", "16000", // 16kHz
"-f", "wav",
"pipe:1", // write to stdout
]);
const out: Buffer[] = [];
proc.stdout.on("data", (c: Buffer) => out.push(c));
proc.stdout.on("end", () => resolve(Buffer.concat(out)));
proc.stderr.on("data", () => {}); // suppress ffmpeg banner
proc.on("error", reject);
proc.stdin.write(inputBuffer);
proc.stdin.end();
});
}
```
**b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)**
```typescript
function wavToOggOpus(inputBuffer: Buffer): Promise<Buffer> {
return new Promise((resolve, reject) => {
const proc = spawn(ffmpegPath!, [
"-i", "pipe:0",
"-c:a", "libopus",
"-b:a", "32k",
"-f", "ogg",
"pipe:1",
]);
// ... same pattern as above
});
}
```
**Confidence: MEDIUM**`ffmpeg-static` macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025.
---
### 3. Telegram Bridge
**Package:** `grammy`
**Version:** `^1.41.1` (latest, supports Bot API 9.6)
**Where it lives:** `server/` as an optional singleton service — only starts if `TELEGRAM_BOT_TOKEN` is set
**Why grammy over alternatives:**
- `grammy` has 1.4M weekly downloads vs `telegraf` at 900K — grammY is now the higher-adoption choice
- grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs
- `node-telegram-bot-api` is lower-level with no middleware, requires more boilerplate for this use case
- grammY's file handling API (`ctx.getFile()`) is the cleanest for the voice relay use case
**What the bridge needs to do (thin relay only — per PROJECT.md):**
```typescript
import { Bot, Context } from "grammy";
const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!);
// Relay text messages to Nexus chat API
bot.on("message:text", async (ctx) => {
const response = await relayToNexus(ctx.message.text, ctx.from.id);
await ctx.reply(response);
});
// Receive voice messages — download OGG, transcribe, relay
bot.on("message:voice", async (ctx) => {
const file = await ctx.getFile();
// file.download() returns Buffer (grammY handles temp URL expiry)
const oggBuffer = await downloadFile(file.file_path!, bot.token);
const transcript = await transcribeOgg(oggBuffer); // via smart-whisper
const response = await relayToNexus(transcript, ctx.from.id);
await ctx.reply(response);
});
// Run with long polling (no webhook needed for single-user local setup)
bot.start();
```
**Voice message format from Telegram:** Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with `ffmpeg-static` pipeline: `ogg→wav16k`.
**To send TTS back to Telegram:** Convert Piper WAV output → OGG Opus via `ffmpeg-static`, then use `ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg"))`.
**Long polling vs webhook:** Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed.
**Confidence: HIGH** — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge.
---
### 4. Server-Side Piper TTS (Audio Response Endpoint)
**No new library needed.** The v1.5 STACK.md already specified the `child_process.spawn` approach with the Piper binary.
**What v1.6 adds on top of v1.5:**
- A new Express route: `POST /api/voice/synthesize` that accepts `{ text, voice? }` and returns raw WAV audio (`Content-Type: audio/wav`)
- This endpoint is used by both the web chat playback (browser `<audio>` element) and the Telegram bridge (convert WAV → OGG for `sendVoice`)
- Voice mode flag: requests with `voiceMode: true` should receive a condensed plain-language response (no markdown, no code blocks) — this is a prompt instruction layer, not a library
**Response shape:**
```
POST /api/voice/synthesize
Body: { text: string, voice?: "en_US-lessac-medium" }
Response: audio/wav binary stream
```
**Confidence: HIGH** — this is an implementation pattern, not a new library.
---
### 5. Audio Playback (Web Chat)
**No new library needed.** The browser's native `<audio>` element handles WAV and OGG playback. The existing `TtsButton` uses `new Audio(url)` already. The v1.6 enhancement is:
- Upgrade from `new Audio(blob)` to a proper inline `<audio controls>` player with auto-play toggle stored in settings
- Use `URL.createObjectURL(blob)` for streaming playback of TTS responses
- Waveform visualization via `AnalyserNode` from the Web Audio API — no library needed
**Confidence: HIGH** — Web Audio API and `<audio>` are native browser APIs. No library required.
---
## Installation Summary
```bash
# ui/ — add VAD for silence detection + auto-send
pnpm --filter @paperclipai/ui add @ricky0123/vad-react
# server/ — add FFmpeg binary (for audio format conversion) and Telegram bot
pnpm --filter @paperclipai/server add ffmpeg-static grammy
# server/ — types (ffmpeg-static ships its own types; grammy is TS-native)
# No @types/* needed for grammy
# ffmpeg-static types are included in the package
```
---
## What NOT to Add
| Avoid | Why | Use Instead |
|-------|-----|-------------|
| `fluent-ffmpeg` | Archived May 2025, no longer maintained | Direct `child_process.spawn` with `ffmpeg-static` binary |
| `@ffmpeg-installer/ffmpeg` | Stale — last updated 2022, ships FFmpeg 4.x | `ffmpeg-static ^5.2.0` (ships FFmpeg 6.1.1, arm64 support) |
| `telegraf` | TypeScript type system "too complex to understand" per maintainers; lower weekly downloads than grammY | `grammy ^1.41.1` |
| `node-telegram-bot-api` | Low-level, requires callback polling setup, no middleware, more boilerplate | `grammy` |
| `@ricky0123/vad-node` | Node.js support was discontinued by the maintainer; wound down | `@ricky0123/vad-react` (browser-only, which is where recording lives) |
| `whisper.js` / `transformers.js` (browser WASM) | 200MB+ model download in browser; slow on first load; server-side Whisper via `smart-whisper` is already in place | `smart-whisper` on server (already in v1.5 stack) |
| `@mintplex-labs/piper-tts-web` for server TTS | Browser WASM only, no Node.js support | Piper binary via `child_process.spawn` (already specified in v1.5) |
| Wake word / real-time streaming audio | Out of scope per PROJECT.md | Future milestone |
---
## Alternatives Considered
| Recommended | Alternative | When to Use Alternative |
|-------------|-------------|-------------------------|
| `grammy ^1.41.1` | `telegraf ^4.x` | If you need a battle-tested library with larger plugin ecosystem and tolerate complex TypeScript types |
| `ffmpeg-static` + `spawn` | `@ffmpeg/ffmpeg` (WASM) | If running in a serverless/edge environment where native binaries are not available — not applicable here |
| `@ricky0123/vad-react` | Manual `AudioWorklet` energy threshold | If you need lower latency or don't want the 5MB ONNX WASM payload; simpler but less accurate silence detection |
| `@ricky0123/vad-react` | `MediaRecorder` with manual stop button (current impl) | The current v1.3 VoiceRecordButton works; VAD is strictly an UX upgrade |
---
## Version Compatibility
| Package | Compatible With | Notes |
|---------|-----------------|-------|
| `grammy ^1.41.1` | Node.js >=18, TypeScript >=5 | Ships its own types, no `@types/grammy` |
| `ffmpeg-static ^5.2.0` | Node.js >=14, macOS arm64 | Downloads correct binary at `npm install` time via `optionalDependencies` |
| `@ricky0123/vad-react ^0.0.36` | React 19, Vite 6 | React 19 peer dep fixed in August 2025; requires SharedArrayBuffer (COOP/COEP headers) for ONNX thread worker |
| `smart-whisper ^0.8.1` | Node.js >=18, macOS arm64 | From v1.5 — verify it's actually installed before v1.6 starts |
**Critical COOP/COEP note for `@ricky0123/vad-react`:** The Silero VAD model runs in an ONNX Runtime Web worker that requires `SharedArrayBuffer`. This means the server must send these headers on HTML responses:
```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
```
This is a one-line addition to the Express static file middleware. Without it, VAD silently fails in Chrome/Firefox. The existing PWA service worker may also need `Cross-Origin-Embedder-Policy: require-corp` to avoid breaking.
---
## Integration Architecture (v1.6 additions only)
```
Browser (UI) Server (Express)
───────────────────────────────── ───────────────────────────────────────
@ricky0123/vad-react POST /api/transcribe (existing)
└── useMicVAD ─────→ └── ffmpeg-static: webm→wav16k
└── onSpeechEnd(Float32Array) └── smart-whisper: wav→text
└── userSpeaking (waveform UI) └── returns { text: string }
React ChatInput (updated) POST /api/voice/synthesize (new)
└── voice mode toggle ─────→ └── Piper binary: text→wav
└── auto-send on speech end └── returns audio/wav stream
└── <audio> inline player ←──────────────┘
Telegram bridge (new, optional)
└── grammy long polling
└── message:text → relayToNexus()
└── message:voice →
ffmpeg: ogg→wav16k
smart-whisper → text
relayToNexus() → response
Piper → wav
ffmpeg: wav→ogg
ctx.replyWithVoice()
```
---
## Sources
- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling confirmed
- [grammY GitHub](https://github.com/grammyjs/grammY) — Bot API 9.6 badge, v1.41.1 version
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
- [grammY comparison with Telegraf](https://grammy.dev/resources/comparison) — TypeScript type quality comparison
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, last published 3 months ago
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — fixed August 28 2025, confirmed
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz confirmed
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement
- [nodejs-whisper GitHub](https://github.com/ChetanXpro/nodejs-whisper) — v0.2.9 comparison (rejected: subprocess-based, 10 months stale)
- [Piper TTS GitHub releases](https://github.com/rhasspy/piper/releases) — macOS aarch64 binary availability
---
*Stack research for: Nexus v1.6 Voice Pipeline + Telegram Bridge*
*Researched: 2026-04-03*
*Supersedes: v1.5 STACK.md entries for smart-whisper and Piper — those remain valid; this file adds the glue and new libraries*