Nexus Dev 0c29013931 docs: complete project research

2026-04-03 23:53:14 +00:00

16 KiB

Raw Blame History

Technology Stack: v1.6 Voice Pipeline + Telegram Bridge

Project: Nexus v1.6 — additive to v1.5 stack (see prior STACK.md for hardware detection, smart-whisper, Puter.js, vectra, openid-client) Researched: 2026-04-03 Scope: NEW libraries only for v1.6 — server-side voice pipeline integration, audio format conversion, browser VAD, Telegram bridge Confidence: MEDIUM-HIGH (grammy HIGH via official docs; vad-react MEDIUM — React 19 peer dep confirmed fixed; ffmpeg-static MEDIUM — archived fluent-ffmpeg confirmed, spawn approach verified)

Context: What v1.5 Already Installed

Do not re-add or re-research these — they are in server/package.json or ui/package.json:

Package	Location	Purpose
`smart-whisper ^0.8.1`	`server/`	Whisper.cpp Node bindings (recommended in v1.5 STACK.md)
`@mintplex-labs/piper-tts-web ^1.0.4`	`ui/`	Browser-side Piper WASM (already installed)
`systeminformation 5`	`server/`	Hardware detection
`multer ^2.0.2`	`server/`	Multipart upload (already handles audio blob uploads)
`express ^5.1.0`	`server/`	HTTP server

The existing VoiceRecordButton already uses MediaRecorder + POST /api/transcribe. The existing usePiperTts hook already uses @mintplex-labs/piper-tts-web for browser-side TTS. The v1.6 work extends this — adding silence detection, server-side TTS, and Telegram relay.

New Libraries by Feature Area

1. Browser VAD (Silence Detection + Auto-Send)

Package: @ricky0123/vad-react Version: ^0.0.36 Where it lives: ui/ only — browser-side ONNX model running off the main thread

Why: The existing VoiceRecordButton requires the user to manually tap Stop. @ricky0123/vad-react uses Silero VAD (ONNX Runtime Web) to detect when the user stops speaking and fires onSpeechEnd automatically with the speech segment as a Float32Array at 16kHz. This eliminates the manual stop button and enables waveform-while-speaking UI via the userSpeaking state flag.

React 19 compatibility: Confirmed fixed in v0.0.36 (August 2025). The peer dependency constraint on React 18 was resolved. No --legacy-peer-deps needed.

API surface:

import { useMicVAD } from "@ricky0123/vad-react";

const vad = useMicVAD({
  startOnLoad: false,              // user must explicitly start
  positiveSpeechThreshold: 0.3,   // sensitivity
  minSpeechMs: 400,               // ignore sub-400ms blips
  redemptionMs: 1400,             // 1.4s silence = end of utterance
  onSpeechEnd: (audio: Float32Array) => {
    // audio is 16kHz Float32Array — matches what Whisper expects
    sendToTranscribeEndpoint(float32ToWav(audio));
  },
});

// vad.userSpeaking — boolean for waveform animation
// vad.listening    — boolean for mic state
// vad.start() / vad.pause()

Key integration note: onSpeechEnd delivers a Float32Array at 16000Hz — this maps directly to what smart-whisper expects on the server side, so no resampling is needed in the browser-to-server path.

Confidence: MEDIUM — Version verified via GitHub issues, React 19 fix confirmed. ONNX Runtime Web dependency means an extra ~5MB WASM download on first load.

2. Audio Format Conversion (Server-Side: WebM → WAV, WAV → OGG)

Package: ffmpeg-static Version: ^5.2.0 (bundles FFmpeg 6.1.1 binaries for macOS arm64 + x64, Linux, Windows) Where it lives: server/ — provides the binary path; invoked via Node.js child_process.spawn

Why ffmpeg-static over alternatives:

fluent-ffmpeg was archived on GitHub May 2025, no longer maintained — do NOT use as a new dependency
@ffmpeg-installer/ffmpeg — last updated 2022, stale binary (FFmpeg 4.x)
ffmpeg-static — actively maintained, ships FFmpeg 6.1.1, macOS arm64 confirmed, installed as an npm dependency (no system-level install needed)
Direct child_process.spawn("ffmpeg", [...]) with the binary path from ffmpeg-static is the recommended approach for 2025+

Two conversions needed:

a) Incoming STT path: WebM/Opus → WAV 16kHz mono (for Whisper)

import ffmpegPath from "ffmpeg-static";
import { spawn } from "node:child_process";

function webmToWav16k(inputBuffer: Buffer): Promise<Buffer> {
  return new Promise((resolve, reject) => {
    const proc = spawn(ffmpegPath!, [
      "-i", "pipe:0",           // read from stdin
      "-acodec", "pcm_s16le",
      "-ac", "1",               // mono
      "-ar", "16000",           // 16kHz
      "-f", "wav",
      "pipe:1",                 // write to stdout
    ]);
    const out: Buffer[] = [];
    proc.stdout.on("data", (c: Buffer) => out.push(c));
    proc.stdout.on("end", () => resolve(Buffer.concat(out)));
    proc.stderr.on("data", () => {});  // suppress ffmpeg banner
    proc.on("error", reject);
    proc.stdin.write(inputBuffer);
    proc.stdin.end();
  });
}

b) Outgoing Telegram TTS path: WAV/PCM → OGG Opus (Telegram voice format)

function wavToOggOpus(inputBuffer: Buffer): Promise<Buffer> {
  return new Promise((resolve, reject) => {
    const proc = spawn(ffmpegPath!, [
      "-i", "pipe:0",
      "-c:a", "libopus",
      "-b:a", "32k",
      "-f", "ogg",
      "pipe:1",
    ]);
    // ... same pattern as above
  });
}

Confidence: MEDIUM — ffmpeg-static macOS arm64 confirmed via GitHub README. Pipe-based approach is well-documented. fluent-ffmpeg archival confirmed May 2025.

3. Telegram Bridge

Package: grammy Version: ^1.41.1 (latest, supports Bot API 9.6) Where it lives: server/ as an optional singleton service — only starts if TELEGRAM_BOT_TOKEN is set

Why grammy over alternatives:

grammy has 1.4M weekly downloads vs telegraf at 900K — grammY is now the higher-adoption choice
grammY is written in TypeScript-first (clean types, no DefinitelyTyped). Telegraf v4 migrated to TS but the type system is described as "too complex to understand" in grammY's own comparison docs
node-telegram-bot-api is lower-level with no middleware, requires more boilerplate for this use case
grammY's file handling API (ctx.getFile()) is the cleanest for the voice relay use case

What the bridge needs to do (thin relay only — per PROJECT.md):

import { Bot, Context } from "grammy";

const bot = new Bot(process.env.TELEGRAM_BOT_TOKEN!);

// Relay text messages to Nexus chat API
bot.on("message:text", async (ctx) => {
  const response = await relayToNexus(ctx.message.text, ctx.from.id);
  await ctx.reply(response);
});

// Receive voice messages — download OGG, transcribe, relay
bot.on("message:voice", async (ctx) => {
  const file = await ctx.getFile();
  // file.download() returns Buffer (grammY handles temp URL expiry)
  const oggBuffer = await downloadFile(file.file_path!, bot.token);
  const transcript = await transcribeOgg(oggBuffer); // via smart-whisper
  const response = await relayToNexus(transcript, ctx.from.id);
  await ctx.reply(response);
});

// Run with long polling (no webhook needed for single-user local setup)
bot.start();

Voice message format from Telegram: Telegram sends voice messages as OGG/Opus, 32kbps, mono, 48kHz. To pass this to Whisper (which needs 16kHz WAV), convert with ffmpeg-static pipeline: ogg→wav16k.

To send TTS back to Telegram: Convert Piper WAV output → OGG Opus via ffmpeg-static, then use ctx.replyWithVoice(new InputFile(oggBuffer, "voice.ogg")).

Long polling vs webhook: Long polling is correct for this deployment (Mac Mini, local network, no public HTTPS endpoint required). No reverse proxy or SSL cert needed.

Confidence: HIGH — grammy official docs verified at grammy.dev. File download pattern confirmed via grammY file handling guide. Bot API 9.6 support confirmed in homepage badge.

4. Server-Side Piper TTS (Audio Response Endpoint)

No new library needed. The v1.5 STACK.md already specified the child_process.spawn approach with the Piper binary.

What v1.6 adds on top of v1.5:

A new Express route: POST /api/voice/synthesize that accepts { text, voice? } and returns raw WAV audio (Content-Type: audio/wav)
This endpoint is used by both the web chat playback (browser <audio> element) and the Telegram bridge (convert WAV → OGG for sendVoice)
Voice mode flag: requests with voiceMode: true should receive a condensed plain-language response (no markdown, no code blocks) — this is a prompt instruction layer, not a library

Response shape:

POST /api/voice/synthesize
Body: { text: string, voice?: "en_US-lessac-medium" }
Response: audio/wav binary stream

Confidence: HIGH — this is an implementation pattern, not a new library.

5. Audio Playback (Web Chat)

No new library needed. The browser's native <audio> element handles WAV and OGG playback. The existing TtsButton uses new Audio(url) already. The v1.6 enhancement is:

Upgrade from new Audio(blob) to a proper inline <audio controls> player with auto-play toggle stored in settings
Use URL.createObjectURL(blob) for streaming playback of TTS responses
Waveform visualization via AnalyserNode from the Web Audio API — no library needed

Confidence: HIGH — Web Audio API and <audio> are native browser APIs. No library required.

Installation Summary

# ui/ — add VAD for silence detection + auto-send
pnpm --filter @paperclipai/ui add @ricky0123/vad-react

# server/ — add FFmpeg binary (for audio format conversion) and Telegram bot
pnpm --filter @paperclipai/server add ffmpeg-static grammy

# server/ — types (ffmpeg-static ships its own types; grammy is TS-native)
# No @types/* needed for grammy
# ffmpeg-static types are included in the package

What NOT to Add

Avoid	Why	Use Instead
`fluent-ffmpeg`	Archived May 2025, no longer maintained	Direct `child_process.spawn` with `ffmpeg-static` binary
`@ffmpeg-installer/ffmpeg`	Stale — last updated 2022, ships FFmpeg 4.x	`ffmpeg-static ^5.2.0` (ships FFmpeg 6.1.1, arm64 support)
`telegraf`	TypeScript type system "too complex to understand" per maintainers; lower weekly downloads than grammY	`grammy ^1.41.1`
`node-telegram-bot-api`	Low-level, requires callback polling setup, no middleware, more boilerplate	`grammy`
`@ricky0123/vad-node`	Node.js support was discontinued by the maintainer; wound down	`@ricky0123/vad-react` (browser-only, which is where recording lives)
`whisper.js` / `transformers.js` (browser WASM)	200MB+ model download in browser; slow on first load; server-side Whisper via `smart-whisper` is already in place	`smart-whisper` on server (already in v1.5 stack)
`@mintplex-labs/piper-tts-web` for server TTS	Browser WASM only, no Node.js support	Piper binary via `child_process.spawn` (already specified in v1.5)
Wake word / real-time streaming audio	Out of scope per PROJECT.md	Future milestone

Alternatives Considered

Recommended	Alternative	When to Use Alternative
`grammy ^1.41.1`	`telegraf ^4.x`	If you need a battle-tested library with larger plugin ecosystem and tolerate complex TypeScript types
`ffmpeg-static` + `spawn`	`@ffmpeg/ffmpeg` (WASM)	If running in a serverless/edge environment where native binaries are not available — not applicable here
`@ricky0123/vad-react`	Manual `AudioWorklet` energy threshold	If you need lower latency or don't want the 5MB ONNX WASM payload; simpler but less accurate silence detection
`@ricky0123/vad-react`	`MediaRecorder` with manual stop button (current impl)	The current v1.3 VoiceRecordButton works; VAD is strictly an UX upgrade

Version Compatibility

Package	Compatible With	Notes
`grammy ^1.41.1`	Node.js >=18, TypeScript >=5	Ships its own types, no `@types/grammy`
`ffmpeg-static ^5.2.0`	Node.js >=14, macOS arm64	Downloads correct binary at `npm install` time via `optionalDependencies`
`@ricky0123/vad-react ^0.0.36`	React 19, Vite 6	React 19 peer dep fixed in August 2025; requires SharedArrayBuffer (COOP/COEP headers) for ONNX thread worker
`smart-whisper ^0.8.1`	Node.js >=18, macOS arm64	From v1.5 — verify it's actually installed before v1.6 starts

Critical COOP/COEP note for @ricky0123/vad-react: The Silero VAD model runs in an ONNX Runtime Web worker that requires SharedArrayBuffer. This means the server must send these headers on HTML responses:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

This is a one-line addition to the Express static file middleware. Without it, VAD silently fails in Chrome/Firefox. The existing PWA service worker may also need Cross-Origin-Embedder-Policy: require-corp to avoid breaking.

Integration Architecture (v1.6 additions only)

Browser (UI)                              Server (Express)
─────────────────────────────────         ───────────────────────────────────────

@ricky0123/vad-react                      POST /api/transcribe (existing)
  └── useMicVAD                    ─────→   └── ffmpeg-static: webm→wav16k
  └── onSpeechEnd(Float32Array)              └── smart-whisper: wav→text
  └── userSpeaking (waveform UI)             └── returns { text: string }

React ChatInput (updated)                 POST /api/voice/synthesize (new)
  └── voice mode toggle            ─────→   └── Piper binary: text→wav
  └── auto-send on speech end               └── returns audio/wav stream
  └── <audio> inline player ←──────────────┘

                                          Telegram bridge (new, optional)
                                            └── grammy long polling
                                            └── message:text → relayToNexus()
                                            └── message:voice →
                                                  ffmpeg: ogg→wav16k
                                                  smart-whisper → text
                                                  relayToNexus() → response
                                                  Piper → wav
                                                  ffmpeg: wav→ogg
                                                  ctx.replyWithVoice()

Sources

grammY official docs — TypeScript support, long polling, file handling confirmed
grammY GitHub — Bot API 9.6 badge, v1.41.1 version
grammY file handling guide — ctx.getFile(), download pattern
grammY comparison with Telegraf — TypeScript type quality comparison
ffmpeg-static GitHub — macOS arm64 binary confirmed, FFmpeg 6.1.1
fluent-ffmpeg archival — archived May 22 2025, confirmed
@ricky0123/vad-react npm — v0.0.36, last published 3 months ago
vad React 19 support issue #188 — fixed August 28 2025, confirmed
vad API docs — onSpeechEnd Float32Array 16kHz confirmed
Telegram Bot API sendVoice — OGG Opus format requirement
nodejs-whisper GitHub — v0.2.9 comparison (rejected: subprocess-based, 10 months stale)
Piper TTS GitHub releases — macOS aarch64 binary availability

Stack research for: Nexus v1.6 Voice Pipeline + Telegram Bridge Researched: 2026-04-03 Supersedes: v1.5 STACK.md entries for smart-whisper and Piper — those remain valid; this file adds the glue and new libraries

16 KiB Raw Blame History