nexus/.planning/research/STACK.md

# Technology Stack: v1.6 Voice Pipeline + Telegram Bridge

**Project:** Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client)
**Researched:** 2026-04-03
**Scope:** NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge
**Confidence:** MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified)

---

## Context: What v1.5 Already Installed

Do not re-add or re-research these — they are in `server/package.json` or `ui/package.json`:

| Package | Location | Purpose |
|---------|----------|---------|
| `smart-whisper ^0.8.1` | `server/` | Whisper.cpp Node bindings (recommended in v1.5 STACK.md) |
| `@mintplex-labs/piper-tts-web ^1.0.4` | `ui/` | Browser-side Piper WASM (already installed) |
| `systeminformation 5` | `server/` | Hardware detection |
| `multer ^2.0.2` | `server/` | Multipart upload (already handles audio blob uploads) |
| `express ^5.1.0` | `server/` | HTTP server |

The existing `VoiceRecordButton` already uses `MediaRecorder` + `POST /api/transcribe`. The existing `usePiperTts` hook already uses `@mintplex-labs/piper-tts-web` for browser-side TTS. The v1.6 work **extends** this — adding silence detection, server-side TTS, and Telegram relay.

---

## New Libraries by Feature Area

### 1. Browser VAD (Silence Detection + Auto-Send)

**Package:** `@ricky0123/vad-react`
**Version:** `^0.0.36`
**Where it lives:** `ui/` only — browser-side ONNX model running off the main thread

**Why:** The existing `VoiceRecordButton` requires the user to manually tap Stop. `@ricky0123/vad-react` uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires `onSpeechEnd` automatically with the speech segment as a `Float32Array` at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the `userSpeaking` state flag.

**React 19 compatibility:** Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No `--legacy-peer-deps` needed.

**API surface:**

```typescript
import { useMicVAD } from "@ricky0123/vad-react";

const vad = useMicVAD({
  startOnLoad: false,              // user must explicitly start
  positiveSpeechThreshold: 0.3,   // sensitivity
  minSpeechMs: 400,               // ignore sub-400ms blips
  redemptionMs: 1400,             // 1.4s silence = end of utterance
  onSpeechEnd: (audio: Float32Array) => {
    // audio is 16kHz Float32Array — matches what Whisper expects
    sendToTranscribeEndpoint(float32ToWav(audio));
  },
});

// vad.userSpeaking — boolean for waveform animation
// vad.listening    — boolean for mic state
// vad.start() / vad.pause()
```

**Key integration note:** `onSpeechEnd` delivers a `Float32Array` at 16000Hz — this maps directly to what `smart-whisper` expects on the server side, so no resampling is needed in the browser-to-server path.

**Confidence: MEDIUM** — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load.

---

### 2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG)

**Package:** `ffmpeg-static`
**Version:** `^5.2.0` (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows)
**Where it lives:** `server/` — provides the binary path; invoked via Node.js `child_process.spawn`

**Why `ffmpeg-static` over alternatives:**
- `fluent-ffmpeg` was archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency
- `@ffmpeg-installer/ffmpeg` — last updated 2022, stale binary (FFmpeg 4.x)
- `ffmpeg-static` — actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed)
- Direct `child_process.spawn("ffmpeg", [...])` with the binary path from `ffmpeg-static` is the recommended approach for 2025+

**Two conversions needed:**

**a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)**

```typescript
import ffmpegPath from "ffmpeg-static";
import { spawn } from "node:child_process";

function webmToWav16k(inputBuffer: Buffer): Promise<Buffer> {
  return new Promise((resolve, reject) => {
    const proc = spawn(ffmpegPath!, [
      "-i", "pipe:0",           // read from stdin
      "-acodec", "pcm_s16le",
      "-ac", "1",               // mono
      "-ar", "16000",           // 16kHz
      "-f", "wav",
      "pipe:1",                 // write to stdout
    ]);
    const out: Buffer[] = [];
    proc.stdout.on("data", (c: Buffer) => out.push(c));
    proc.stdout.on("end", () => resolve(Buffer.concat(out)));
    proc.stderr.on("data", () => {});  // suppress ffmpeg banner
    proc.on("error", reject);
    proc.stdin.write(inputBuffer);
    proc.stdin.end();
  });
}
```

**b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)**

```typescript
function wavToOggOpus(inputBuffer: Buffer): Promise<Buffer> {
  return new Promise((resolve, reject) => {
    const proc = spawn(ffmpegPath!, [
      "-i", "pipe:0",
      "-c:a", "libopus",
      "-b:a", "32k",
      "-f", "ogg",
      "pipe:1",
    ]);
    // ... same pattern as above
  });
}
```

**Confidence: MEDIUM** — `ffmpeg-static` macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025.

---

### 3. Telegram Bridge

**Package:** `grammy`
**Version:** `^1.41.1` (latest, supports Bot API 9.6)
**Where it lives:** `server/` as an optional singleton service — only starts if `TELEGRAM_BOT_TOKEN` is set

**Why grammy over alternatives:**
- `grammy` has 1.4M weekly downloads vs `telegraf` at 900K — grammY is now the higher-adoption choice
- grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs
- `node-telegram-bot-api` is lower-level with no middleware, requires more boilerplate for this use case
- grammY's file handling API (`ctx.getFile()`) is the cleanest for the voice relay use case

**What the bridge needs to do (thin relay only — per PROJECT.md):**

```typescript
import { Bot, Context } from "grammy";

const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!);

// Relay text messages to Nexus chat API
bot.on("message:text", async (ctx) => {
  const response = await relayToNexus(ctx.message.text, ctx.from.id);
  await ctx.reply(response);
});

// Receive voice messages — download OGG, transcribe, relay
bot.on("message:voice", async (ctx) => {
  const file = await ctx.getFile();
  // file.download() returns Buffer (grammY handles temp URL expiry)
  const oggBuffer = await downloadFile(file.file_path!, bot.token);
  const transcript = await transcribeOgg(oggBuffer); // via smart-whisper
  const response = await relayToNexus(transcript, ctx.from.id);
  await ctx.reply(response);
});

// Run with long polling (no webhook needed for single-user local setup)
bot.start();
```

**Voice message format from Telegram:** Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with `ffmpeg-static` pipeline: `ogg→wav16k`.

**To send TTS back to Telegram:** Convert Piper WAV output → OGG Opus via `ffmpeg-static`, then use `ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg"))`.

**Long polling vs webhook:** Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed.

**Confidence: HIGH** — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge.

---

### 4. Server-Side Piper TTS (Audio Response Endpoint)

**No new library needed.** The v1.5 STACK.md already specified the `child_process.spawn` approach with the Piper binary.

**What v1.6 adds on top of v1.5:**
- A new Express route: `POST /api/voice/synthesize` that accepts `{ text, voice? }` and returns raw WAV audio (`Content-Type: audio/wav`)
- This endpoint is used by both the web chat playback (browser `<audio>` element) and the Telegram bridge (convert WAV → OGG for `sendVoice`)
- Voice mode flag: requests with `voiceMode: true` should receive a condensed plain-language response (no markdown, no code blocks) — this is a prompt instruction layer, not a library

**Response shape:**

```
POST /api/voice/synthesize
Body: { text: string, voice?: "en_US-lessac-medium" }
Response: audio/wav binary stream
```

**Confidence: HIGH** — this is an implementation pattern, not a new library.

---

### 5. Audio Playback (Web Chat)

**No new library needed.** The browser's native `<audio>` element handles WAV and OGG playback. The existing `TtsButton` uses `new Audio(url)` already. The v1.6 enhancement is:

- Upgrade from `new Audio(blob)` to a proper inline `<audio controls>` player with auto-play toggle stored in settings
- Use `URL.createObjectURL(blob)` for streaming playback of TTS responses
- Waveform visualization via `AnalyserNode` from the Web Audio API — no library needed

**Confidence: HIGH** — Web Audio API and `<audio>` are native browser APIs. No library required.

---

## Installation Summary

```bash
# ui/ — add VAD for silence detection + auto-send
pnpm --filter @paperclipai/ui add @ricky0123/vad-react

# server/ — add FFmpeg binary (for audio format conversion) and Telegram bot
pnpm --filter @paperclipai/server add ffmpeg-static grammy

# server/ — types (ffmpeg-static ships its own types; grammy is TS-native)
# No @types/* needed for grammy
# ffmpeg-static types are included in the package
```

---

## What NOT to Add

| Avoid | Why | Use Instead |
|-------|-----|-------------|
| `fluent-ffmpeg` | Archived May 2025, no longer maintained | Direct `child_process.spawn` with `ffmpeg-static` binary |
| `@ffmpeg-installer/ffmpeg` | Stale — last updated 2022, ships FFmpeg 4.x | `ffmpeg-static ^5.2.0` (ships FFmpeg 6.1.1, arm64 support) |
| `telegraf` | TypeScript type system "too complex to understand" per maintainers; lower weekly downloads than grammY | `grammy ^1.41.1` |
| `node-telegram-bot-api` | Low-level, requires callback polling setup, no middleware, more boilerplate | `grammy` |
| `@ricky0123/vad-node` | Node.js support was discontinued by the maintainer; wound down | `@ricky0123/vad-react` (browser-only, which is where recording lives) |
| `whisper.js` / `transformers.js` (browser WASM) | 200MB+ model download in browser; slow on first load; server-side Whisper via `smart-whisper` is already in place | `smart-whisper` on server (already in v1.5 stack) |
| `@mintplex-labs/piper-tts-web` for server TTS | Browser WASM only, no Node.js support | Piper binary via `child_process.spawn` (already specified in v1.5) |
| Wake word / real-time streaming audio | Out of scope per PROJECT.md | Future milestone |

---

## Alternatives Considered

| Recommended | Alternative | When to Use Alternative |
|-------------|-------------|-------------------------|
| `grammy ^1.41.1` | `telegraf ^4.x` | If you need a battle-tested library with larger plugin ecosystem and tolerate complex TypeScript types |
| `ffmpeg-static` + `spawn` | `@ffmpeg/ffmpeg` (WASM) | If running in a serverless/edge environment where native binaries are not available — not applicable here |
| `@ricky0123/vad-react` | Manual `AudioWorklet` energy threshold | If you need lower latency or don't want the 5MB ONNX WASM payload; simpler but less accurate silence detection |
| `@ricky0123/vad-react` | `MediaRecorder` with manual stop button (current impl) | The current v1.3 VoiceRecordButton works; VAD is strictly an UX upgrade |

---

## Version Compatibility

| Package | Compatible With | Notes |
|---------|-----------------|-------|
| `grammy ^1.41.1` | Node.js >=18, TypeScript >=5 | Ships its own types, no `@types/grammy` |
| `ffmpeg-static ^5.2.0` | Node.js >=14, macOS arm64 | Downloads correct binary at `npm install` time via `optionalDependencies` |
| `@ricky0123/vad-react ^0.0.36` | React 19, Vite 6 | React 19 peer dep fixed in August 2025; requires SharedArrayBuffer (COOP/COEP headers) for ONNX thread worker |
| `smart-whisper ^0.8.1` | Node.js >=18, macOS arm64 | From v1.5 — verify it's actually installed before v1.6 starts |

**Critical COOP/COEP note for `@ricky0123/vad-react`:** The Silero VAD model runs in an ONNX Runtime Web worker that requires `SharedArrayBuffer`. This means the server must send these headers on HTML responses:

```
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
```

This is a one-line addition to the Express static file middleware. Without it, VAD silently fails in Chrome/Firefox. The existing PWA service worker may also need `Cross-Origin-Embedder-Policy: require-corp` to avoid breaking.

---

## Integration Architecture (v1.6 additions only)

```
Browser (UI)                              Server (Express)
─────────────────────────────────         ───────────────────────────────────────

@ricky0123/vad-react                      POST /api/transcribe (existing)
  └── useMicVAD                    ─────→   └── ffmpeg-static: webm→wav16k
  └── onSpeechEnd(Float32Array)              └── smart-whisper: wav→text
  └── userSpeaking (waveform UI)             └── returns { text: string }

React ChatInput (updated)                 POST /api/voice/synthesize (new)
  └── voice mode toggle            ─────→   └── Piper binary: text→wav
  └── auto-send on speech end               └── returns audio/wav stream
  └── <audio> inline player ←──────────────┘

                                          Telegram bridge (new, optional)
                                            └── grammy long polling
                                            └── message:text → relayToNexus()
                                            └── message:voice →
                                                  ffmpeg: ogg→wav16k
                                                  smart-whisper → text
                                                  relayToNexus() → response
                                                  Piper → wav
                                                  ffmpeg: wav→ogg
                                                  ctx.replyWithVoice()
```

---

## Sources

- [grammY official docs](https://grammy.dev/) — TypeScript support, long polling, file handling confirmed
- [grammY GitHub](https://github.com/grammyjs/grammY) — Bot API 9.6 badge, v1.41.1 version
- [grammY file handling guide](https://grammy.dev/guide/files) — `ctx.getFile()`, download pattern
- [grammY comparison with Telegraf](https://grammy.dev/resources/comparison) — TypeScript type quality comparison
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, FFmpeg 6.1.1
- [fluent-ffmpeg archival](https://github.com/fluent-ffmpeg/node-fluent-ffmpeg) — archived May 22 2025, confirmed
- [@ricky0123/vad-react npm](https://www.npmjs.com/package/@ricky0123/vad-react) — v0.0.36, last published 3 months ago
- [vad React 19 support issue #188](https://github.com/ricky0123/vad/issues/188) — fixed August 28 2025, confirmed
- [vad API docs](https://docs.vad.ricky0123.com/user-guide/api/) — `onSpeechEnd` Float32Array 16kHz confirmed
- [Telegram Bot API sendVoice](https://core.telegram.org/bots/api#sendvoice) — OGG Opus format requirement
- [nodejs-whisper GitHub](https://github.com/ChetanXpro/nodejs-whisper) — v0.2.9 comparison (rejected: subprocess-based, 10 months stale)
- [Piper TTS GitHub releases](https://github.com/rhasspy/piper/releases) — macOS aarch64 binary availability

---

*Stack research for: Nexus v1.6 Voice Pipeline + Telegram Bridge*
*Researched: 2026-04-03*
*Supersedes: v1.5 STACK.md entries for smart-whisper and Piper — those remain valid; this file adds the glue and new libraries*