docs(36): research voice pipeline foundation
This commit is contained in:
parent
0dfd0cbac5
commit
0736541a91
1 changed files with 607 additions and 0 deletions
607
.planning/phases/36-voice-pipeline-foundation/36-RESEARCH.md
Normal file
607
.planning/phases/36-voice-pipeline-foundation/36-RESEARCH.md
Normal file
|
|
@ -0,0 +1,607 @@
|
|||
# Phase 36: Voice Pipeline Foundation - Research
|
||||
|
||||
**Researched:** 2026-04-03
|
||||
**Domain:** Server-side STT/TTS voice pipeline — ffmpeg transcoding, VoicePipelineService abstraction, dual output formatting, voiceMode flag propagation, nexus-settings schema extension
|
||||
**Confidence:** HIGH
|
||||
|
||||
---
|
||||
|
||||
<user_constraints>
|
||||
## User Constraints (from CONTEXT.md)
|
||||
|
||||
### Locked Decisions
|
||||
All implementation choices are at Claude's discretion — discuss phase was skipped per user setting. Use ROADMAP phase goal, success criteria, and codebase conventions to guide decisions.
|
||||
|
||||
Key research findings to incorporate:
|
||||
- VoicePipelineService as server-side service: `transcribe(buffer, format)`, `synthesize(text, voiceId?)`, `formatForVoice(text)`
|
||||
- Move `/transcribe` from `chat-files.ts` to new `voice.ts` route to reduce rebase conflict surface
|
||||
- Use `ffmpeg-static ^5.2.0` (NOT archived fluent-ffmpeg) for WebM to WAV and OGG to WAV transcoding
|
||||
- Use `execFile` (not `exec`) for CLI subprocess calls — prevents shell injection
|
||||
- Wrap CLI calls (`piper`, `ffmpeg`) in `Promise.race([call, timeout(8000)])` for graceful degradation
|
||||
- Voice mode flag must survive: client to Express to message persistence to agent session codec
|
||||
- Dual output: prompt engineering requests `SPOKEN: [prose]` plus `DETAILED: [markdown]` with post-processing strip as fallback
|
||||
- nexus-settings schema extension: `voiceMode: "text" | "voice_input" | "full_voice"`, optional `telegramToken`
|
||||
- No DB migrations — all state in existing JSONB fields and file-backed JSON
|
||||
|
||||
### Claude's Discretion
|
||||
All implementation choices are at Claude's discretion — discuss phase was skipped.
|
||||
|
||||
### Deferred Ideas (OUT OF SCOPE)
|
||||
None — discuss phase skipped.
|
||||
</user_constraints>
|
||||
|
||||
---
|
||||
|
||||
<phase_requirements>
|
||||
## Phase Requirements
|
||||
|
||||
| ID | Description | Research Support |
|
||||
|----|-------------|------------------|
|
||||
| VPIPE-01 | User's voice input is transcribed via local Whisper STT with automatic language detection | Existing whisper-cpp/openai-whisper cascade in chat-files.ts; move to VoicePipelineService.transcribe(); add `-l auto` flag for language detection |
|
||||
| VPIPE-02 | Agent text responses are synthesized to speech via local Piper TTS in under 3 seconds | Piper via execFile with sentence chunking; warmup call on startup; 8s timeout via Promise.race |
|
||||
| VPIPE-03 | Voice pipeline accepts audio from any transport via a shared VoicePipelineService | Service abstraction in server/src/services/voice-pipeline.ts; consumed by voice.ts routes and future Telegram service |
|
||||
| VPIPE-04 | Audio from any source transcoded to WAV 16kHz mono via ffmpeg before Whisper processing | ffmpeg-static ^5.3.0; spawn with `-ar 16000 -ac 1`; pipe buffer through ffmpeg stdin/stdout |
|
||||
| VPIPE-05 | Voice mode flag on messages triggers voice-optimized response formatting | Add `voiceMode` field to createMessageSchema Zod validator; pass through chat.ts stream endpoint; inject as system prompt instruction |
|
||||
| VPIPE-06 | Every voice interaction produces dual output: spoken prose + full text with code blocks | Dual output via SPOKEN/DETAILED prompt template injected when voiceMode equals full_voice; post-process strip as fallback |
|
||||
</phase_requirements>
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 36 builds the transport-agnostic voice pipeline that all subsequent phases (37: Web Chat Voice UI, 38: Telegram Bridge) depend on. The work is purely server-side with zero UI changes. Three deliverables gate everything downstream: (1) `VoicePipelineService` in `server/src/services/voice-pipeline.ts` with `transcribe()`, `synthesize()`, and `formatForVoice()` methods; (2) `server/src/routes/voice.ts` with `POST /api/transcribe` (moved from `chat-files.ts`) and `POST /api/synthesize`; (3) voiceMode flag wired from `createMessageSchema` through the stream endpoint to the AI prompt.
|
||||
|
||||
The codebase already has a working Whisper cascade in `chat-files.ts` (lines 316–386) that handles the transcription pattern. The main work is: extract that logic into `VoicePipelineService`, add ffmpeg transcoding before Whisper (currently missing — the route writes raw WebM to disk and passes it directly to whisper-cpp without explicit format conversion), add a synthesize method calling Piper, extend `nexus-settings` schema with `voiceMode` and `telegramToken`, and propagate `voiceMode` through the message pipeline.
|
||||
|
||||
**Primary recommendation:** Build VoicePipelineService first, then voice.ts routes, then schema changes in the shared package, then wire voiceMode through chat.ts stream endpoint. This ordering makes each deliverable independently testable before the next is built.
|
||||
|
||||
---
|
||||
|
||||
## Standard Stack
|
||||
|
||||
### Core
|
||||
|
||||
| Library | Version | Purpose | Why Standard |
|
||||
|---------|---------|---------|--------------|
|
||||
| ffmpeg-static | ^5.3.0 | Ships FFmpeg 6.1.1 binaries for macOS arm64/Linux; no system ffmpeg required | fluent-ffmpeg archived May 2025; ffmpeg-static is the maintained replacement |
|
||||
| smart-whisper | ^0.8.1 | Node bindings for whisper.cpp with Apple Silicon acceleration | Already in codebase (used in chat-files cascade fallback); avoid re-implementing |
|
||||
| zod | ^3.24.2 | Schema validation for settings extension and message validators | Already used throughout — do not introduce a second validator library |
|
||||
| vitest | ^3.0.5 | Unit tests for VoicePipelineService and voice routes | Already the test framework (`vitest.config.ts` in server root) |
|
||||
|
||||
### Supporting
|
||||
|
||||
| Library | Version | Purpose | When to Use |
|
||||
|---------|---------|---------|-------------|
|
||||
| multer | ^2.0.2 | Multipart audio upload handling | Already used in chat-files.ts; reuse the same pattern in voice.ts |
|
||||
| node:child_process | built-in | `execFile` wrapper for piper and ffmpeg CLI calls | Use `promisify(execFile)` — same pattern as git-file-service.ts and existing transcribe route |
|
||||
|
||||
### Alternatives Considered
|
||||
|
||||
| Instead of | Could Use | Tradeoff |
|
||||
|------------|-----------|----------|
|
||||
| ffmpeg-static | fluent-ffmpeg | fluent-ffmpeg archived May 2025 — do not use |
|
||||
| ffmpeg-static | @ffmpeg-installer/ffmpeg | Ships FFmpeg 4.x, not 6.x — older codec support |
|
||||
| execFile subprocess | ffmpeg npm bindings | No mature maintained binding; CLI approach is industry standard |
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
cd /opt/nexus/server && pnpm add ffmpeg-static
|
||||
cd /opt/nexus/server && pnpm add -D @types/ffmpeg-static
|
||||
```
|
||||
|
||||
**Version verification:** `npm view ffmpeg-static version` returns `5.3.0` (verified 2026-04-03). Use `^5.3.0` as the lower bound (newer than the `^5.2.0` in the CONTEXT.md decisions, but backwards compatible).
|
||||
|
||||
---
|
||||
|
||||
## Architecture Patterns
|
||||
|
||||
### Recommended Project Structure
|
||||
|
||||
```
|
||||
server/src/
|
||||
├── services/
|
||||
│ └── voice-pipeline.ts # NEW: VoicePipelineService (transcribe, synthesize, formatForVoice)
|
||||
├── routes/
|
||||
│ ├── voice.ts # NEW: POST /api/transcribe, POST /api/synthesize
|
||||
│ ├── chat-files.ts # MODIFIED: remove POST /transcribe block (lines 297-386)
|
||||
│ └── chat.ts # MODIFIED: pass voiceMode through stream endpoint
|
||||
|
||||
packages/shared/src/
|
||||
├── validators/
|
||||
│ └── chat.ts # MODIFIED: add voiceMode to createMessageSchema
|
||||
└── types/
|
||||
└── chat.ts # MODIFIED: add voiceMode to ChatMessage interface
|
||||
|
||||
server/src/services/
|
||||
└── nexus-settings.ts # MODIFIED: add voiceMode + telegramToken fields
|
||||
```
|
||||
|
||||
### Pattern 1: VoicePipelineService Structure
|
||||
|
||||
**What:** Factory function returning a service object — matches the existing codebase pattern (all services use factory functions, not classes).
|
||||
|
||||
**When to use:** Any server-side code needing STT, TTS, or voice formatting.
|
||||
|
||||
```typescript
|
||||
// server/src/services/voice-pipeline.ts
|
||||
import { execFile as execFileCb, spawn } from "node:child_process";
|
||||
import { promisify } from "node:util";
|
||||
import { tmpdir } from "node:os";
|
||||
import { writeFile, unlink } from "node:fs/promises";
|
||||
import path from "node:path";
|
||||
import ffmpegPath from "ffmpeg-static";
|
||||
|
||||
const execFile = promisify(execFileCb); // same pattern as git-file-service.ts
|
||||
|
||||
function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
|
||||
return Promise.race([
|
||||
promise,
|
||||
new Promise<never>((_, reject) =>
|
||||
setTimeout(() => reject(new Error(`Timed out after ${ms}ms`)), ms)
|
||||
),
|
||||
]);
|
||||
}
|
||||
|
||||
export function voicePipelineService() {
|
||||
// Assert at construction time — fail fast rather than at first request
|
||||
if (!ffmpegPath) throw new Error("ffmpeg-static binary not found on this platform");
|
||||
|
||||
async function transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer> {
|
||||
// Uses spawn with stdin/stdout pipes — no temp files, no shell expansion
|
||||
// See Pattern 2 for full implementation
|
||||
}
|
||||
|
||||
async function transcribe(
|
||||
buffer: Buffer,
|
||||
format: "webm" | "ogg" | "wav",
|
||||
): Promise<{ text: string; language?: string }> {
|
||||
// 1. Transcode to WAV 16kHz mono (skip if already wav)
|
||||
// 2. Write WAV to temp file
|
||||
// 3. Run whisper-cpp with --language auto, or openai-whisper fallback
|
||||
// 4. Clean up temp file in finally block
|
||||
}
|
||||
|
||||
async function synthesize(text: string, voiceId?: string): Promise<Buffer> {
|
||||
// 1. Chunk text into sentences
|
||||
// 2. Run piper per chunk via execFile (not exec)
|
||||
// 3. Concatenate WAV buffers
|
||||
}
|
||||
|
||||
function formatForVoice(text: string): string {
|
||||
// Strip markdown: headings, bold, italic, code fences, bullet points
|
||||
// Fallback for when SPOKEN/DETAILED markers are absent
|
||||
}
|
||||
|
||||
return { transcribe, synthesize, formatForVoice };
|
||||
}
|
||||
```
|
||||
|
||||
### Pattern 2: ffmpeg Transcode via Pipe (Buffer In, Buffer Out)
|
||||
|
||||
**What:** Transcode audio buffer to WAV 16kHz mono without writing input to disk. Uses `spawn` with stdio pipes — this is not `exec` and does not expand shell metacharacters.
|
||||
|
||||
**When to use:** Any audio buffer received from HTTP multipart or Telegram file download.
|
||||
|
||||
```typescript
|
||||
// Source: ffmpeg-static GitHub + Node.js child_process docs
|
||||
import { spawn } from "node:child_process";
|
||||
import ffmpegPath from "ffmpeg-static";
|
||||
|
||||
function transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer> {
|
||||
return new Promise((resolve, reject) => {
|
||||
// ffmpegPath is a string (asserted at service construction), inputFormat from multer mimetype
|
||||
const ff = spawn(
|
||||
ffmpegPath as string,
|
||||
[
|
||||
"-f", inputFormat, // e.g. "webm" or "ogg" — from multer file.mimetype, not user input
|
||||
"-i", "pipe:0", // read from stdin
|
||||
"-ar", "16000", // 16kHz sample rate — required by Whisper
|
||||
"-ac", "1", // mono
|
||||
"-f", "wav",
|
||||
"pipe:1", // write to stdout
|
||||
],
|
||||
{ stdio: ["pipe", "pipe", "pipe"] },
|
||||
);
|
||||
const chunks: Buffer[] = [];
|
||||
ff.stdout.on("data", (chunk: Buffer) => chunks.push(chunk));
|
||||
ff.on("close", (code) => {
|
||||
if (code === 0) resolve(Buffer.concat(chunks));
|
||||
else reject(new Error(`ffmpeg exited with code ${code}`));
|
||||
});
|
||||
ff.on("error", reject);
|
||||
ff.stdin.write(inputBuffer);
|
||||
ff.stdin.end();
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
**Security note:** `spawn` (unlike `exec`) does not invoke a shell. Arguments are passed as an array — no shell expansion occurs. `inputFormat` is derived from `file.mimetype` set by multer, not from raw user input.
|
||||
|
||||
### Pattern 3: nexus-settings Schema Extension
|
||||
|
||||
**What:** Extend Zod schema in `nexus-settings.ts` — existing pattern uses `z.object()` with `.default()` values. New fields use `.default()` or `.optional()` so existing `nexus-settings.json` files parse without error.
|
||||
|
||||
```typescript
|
||||
// MODIFIED: server/src/services/nexus-settings.ts
|
||||
export const VOICE_MODES = ["text", "voice_input", "full_voice"] as const;
|
||||
export type VoiceMode = (typeof VOICE_MODES)[number];
|
||||
|
||||
const nexusSettingsSchema = z.object({
|
||||
mode: z.enum(NEXUS_MODES).default("both"),
|
||||
voiceEnabled: z.boolean().default(false),
|
||||
voiceMode: z.enum(VOICE_MODES).default("text"), // NEW
|
||||
telegramToken: z.string().optional(), // NEW
|
||||
});
|
||||
```
|
||||
|
||||
No file migration needed — Zod `.default()` and `.optional()` handle missing fields in existing `nexus-settings.json` gracefully.
|
||||
|
||||
### Pattern 4: voiceMode in createMessageSchema
|
||||
|
||||
**What:** Add optional `voiceMode` field to the shared Zod validator. Matches the same pattern as `messageType` which is already optional.
|
||||
|
||||
```typescript
|
||||
// MODIFIED: packages/shared/src/validators/chat.ts
|
||||
export const createMessageSchema = z.object({
|
||||
role: z.enum(["user", "assistant", "system"]),
|
||||
content: z.string().min(1).max(100_000),
|
||||
agentId: z.string().uuid().optional(),
|
||||
messageType: z.string().optional(),
|
||||
voiceMode: z.enum(["text", "voice_input", "full_voice"]).optional(), // NEW
|
||||
});
|
||||
```
|
||||
|
||||
**On persistence:** The `chatMessages` DB table has a `message_type` text column with no constraints. Store `voiceMode` there (e.g., value `"voice_full"` or `"voice_input"`) so it survives the request boundary and is queryable by Phase 37 UI. No DB migration needed — the column already exists as a free-text field.
|
||||
|
||||
### Pattern 5: Dual Output via Prompt Engineering
|
||||
|
||||
**What:** Inject a system prompt suffix when `voiceMode === "full_voice"` that instructs the AI to produce two labeled sections.
|
||||
|
||||
```typescript
|
||||
// In the stream endpoint (chat.ts), after resolving memory and before the token loop:
|
||||
const { content, agentId, voiceMode } = req.body as {
|
||||
content: string; agentId?: string; voiceMode?: string;
|
||||
};
|
||||
|
||||
if (voiceMode === "full_voice") {
|
||||
messagesWithMemory.push({
|
||||
role: "system",
|
||||
content: [
|
||||
"Format your response with EXACTLY these two labeled sections:",
|
||||
"",
|
||||
"SPOKEN: [Natural speech prose. No markdown. No bullet points. No code blocks. 2-3 sentences for spoken delivery.]",
|
||||
"",
|
||||
"DETAILED: [Your full response with markdown, code blocks, and all detail.]",
|
||||
].join("\n"),
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
Post-processing fallback in `formatForVoice()`: if AI response does not contain the `SPOKEN:` marker, strip markdown symbols from the full content and use it as the spoken text. This handles the ~10% format failure rate on smaller models.
|
||||
|
||||
### Anti-Patterns to Avoid
|
||||
|
||||
- **Using `exec` instead of `execFile`:** `exec` passes the command through a shell, enabling injection. Always use `execFile` with an array of arguments. The codebase convention confirmed in `git-file-service.ts` and `chat-files.ts` is `promisify(execFileCb)`.
|
||||
- **Temp file leaks:** If a temp WAV file is written before passing to whisper, cleanup must be in a `finally` block. The pipe approach in Pattern 2 avoids this entirely for the ffmpeg step.
|
||||
- **fluent-ffmpeg:** Archived May 2025 — do not install or reference.
|
||||
- **Spawning piper on full multi-paragraph text:** Piper silently truncates responses over ~400 characters. Chunk into sentences before synthesis.
|
||||
- **Missing `ffmpegPath` null check:** `ffmpeg-static` returns `null` on unsupported platforms. Assert at service construction time, not at call time.
|
||||
|
||||
---
|
||||
|
||||
## Don't Hand-Roll
|
||||
|
||||
| Problem | Don't Build | Use Instead | Why |
|
||||
|---------|-------------|-------------|-----|
|
||||
| FFmpeg binary distribution | Custom binary download logic | ffmpeg-static ^5.3.0 | Ships FFmpeg 6.1.1 binaries for macOS arm64 + Linux; resolves binary path automatically |
|
||||
| Markdown stripping for TTS | Regex soup | `formatForVoice()` with an explicit targeted strip list | Regex soup breaks on edge cases; keep the strip list explicit and unit-tested |
|
||||
| Audio format detection | Magic byte inspection | Use `file.mimetype` from multer | multer already validates and populates mimetype; trust it for format selection |
|
||||
| Sentence chunking | NLP library | Simple `.split(/(?<=[.!?])\s+/)` with length cap at 100 chars | Works for 95% of responses; no dependency needed for Phase 36 |
|
||||
|
||||
**Key insight:** The hardest part of this phase is not the audio processing — it is ensuring every layer of the message pipeline respects the `voiceMode` flag. Audit the full chain (request body → Zod parse → addMessage → stream endpoint prompt injection) before building the dual output feature on top of it.
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Pitfall 1: ffmpeg Not Found at Runtime
|
||||
|
||||
**What goes wrong:** `spawn(ffmpegPath, ...)` throws `ENOENT` in production or service environments.
|
||||
|
||||
**Why it happens:** `ffmpeg-static` returns `null` on platforms without a prebuilt binary. The returned path must be used directly — do not re-resolve via `PATH`.
|
||||
|
||||
**How to avoid:** At service construction, assert `if (!ffmpegPath) throw new Error("ffmpeg-static binary not found")`. Add a startup smoke test: spawn ffmpeg with `-version` and log the result. Fail fast rather than failing on first request.
|
||||
|
||||
**Warning signs:** `Error: spawn null ENOENT` or `TypeError: path must be a string`.
|
||||
|
||||
### Pitfall 2: Raw WebM Sent Directly to Whisper Without Transcoding
|
||||
|
||||
**What goes wrong:** The existing `chat-files.ts` transcription route writes raw WebM to disk and passes it to `whisper-cpp` without transcoding. This works accidentally when whisper-cpp was compiled with libavformat, but fails silently or returns garbage on other builds.
|
||||
|
||||
**Why it happens:** The original code relied on whisper-cpp's built-in demuxer rather than explicit format conversion.
|
||||
|
||||
**How to avoid:** Always transcode to WAV 16kHz mono via ffmpeg first, regardless of whether the input format might work natively. VPIPE-04 requires this explicitly. The pipe approach handles both WebM and OGG paths identically.
|
||||
|
||||
**Warning signs:** Transcription returns empty string or garbled text on WebM or OGG input.
|
||||
|
||||
### Pitfall 3: voiceMode Flag Stripped at addMessage()
|
||||
|
||||
**What goes wrong:** The stream endpoint reads `voiceMode` from `req.body` and uses it for the system prompt, but then calls `svc.addMessage()` which only accepts `{ role, content, agentId, messageType }`. The flag is silently dropped and never stored.
|
||||
|
||||
**Why it happens:** `createMessageSchema` in the shared validators does not include `voiceMode`, so Zod strips it.
|
||||
|
||||
**How to avoid:** Add `voiceMode` to `createMessageSchema` as an optional field. In `addMessage()`, store the value in the `messageType` column (e.g., pass `messageType: voiceMode` when voiceMode is set). The DB `message_type` column is a free-text field with no constraints — it can store any string value.
|
||||
|
||||
**Warning signs:** Phase 37 UI cannot determine which messages were voice messages when rendering chat history.
|
||||
|
||||
### Pitfall 4: Piper Process Reload Per Request
|
||||
|
||||
**What goes wrong:** Each `synthesize()` call spawns a new `piper` process. Piper loads the ONNX model fresh each time (200–800ms overhead). Responses over ~400 characters are silently truncated.
|
||||
|
||||
**Why it happens:** Standard one-shot `execFile` pattern for CLI tools.
|
||||
|
||||
**How to avoid for Phase 36:** Chunk text into sentences (max ~100 chars per chunk) before calling piper. Add a warmup call at server startup. A persistent piper process architecture is deferred to Phase 39 — sentence chunking is the correct Phase 36 mitigation that keeps VPIPE-02 within 3 seconds for typical responses.
|
||||
|
||||
**Warning signs:** VPIPE-02 latency exceeds 3 seconds. Final sentence of long responses missing from audio output.
|
||||
|
||||
### Pitfall 5: Dual Output Markers Absent on Smaller Models
|
||||
|
||||
**What goes wrong:** The AI responds without `SPOKEN:` and `DETAILED:` section markers (~10% of calls on 7B-class models). `synthesize()` receives the full markdown response and speaks the markdown symbols aloud.
|
||||
|
||||
**Why it happens:** Smaller models have lower format-adherence reliability on structured output prompts.
|
||||
|
||||
**How to avoid:** Implement post-processing fallback in `formatForVoice()`: if the `SPOKEN:` marker is absent in the response, strip markdown and use the full content. The dual output prompt is Approach A; the strip fallback is Approach B. Both must be implemented — B is a required safety net, not optional.
|
||||
|
||||
**Warning signs:** TTS receives content containing `asterisk asterisk` or triple-backtick code fence text.
|
||||
|
||||
### Pitfall 6: multer Audio Upload Config Not Exported from chat-files.ts
|
||||
|
||||
**What goes wrong:** When creating `voice.ts`, a developer attempts to import the `audioUpload` multer instance from `chat-files.ts`. It is defined inline and not exported.
|
||||
|
||||
**Why it happens:** The multer config for audio is scoped inside `chatFileRoutes()`.
|
||||
|
||||
**How to avoid:** Define a fresh multer instance in `voice.ts`. Import `MAX_ATTACHMENT_BYTES` from `../attachment-types.js` to keep the file size limit consistent across all upload endpoints.
|
||||
|
||||
---
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Voice Route: POST /api/transcribe and POST /api/synthesize
|
||||
|
||||
```typescript
|
||||
// server/src/routes/voice.ts
|
||||
import { Router } from "express";
|
||||
import multer from "multer";
|
||||
import { assertBoard } from "./authz.js";
|
||||
import { voicePipelineService } from "../services/voice-pipeline.js";
|
||||
import { MAX_ATTACHMENT_BYTES } from "../attachment-types.js";
|
||||
|
||||
const audioUpload = multer({
|
||||
storage: multer.memoryStorage(),
|
||||
limits: { fileSize: MAX_ATTACHMENT_BYTES, files: 1 },
|
||||
});
|
||||
|
||||
export function voiceRoutes(): Router {
|
||||
const router = Router();
|
||||
const svc = voicePipelineService();
|
||||
|
||||
router.post("/transcribe", async (req, res) => {
|
||||
assertBoard(req);
|
||||
await new Promise<void>((resolve, reject) =>
|
||||
audioUpload.single("audio")(req, res, (err) => (err ? reject(err) : resolve()))
|
||||
);
|
||||
const file = (req as any).file as { buffer: Buffer; mimetype: string } | undefined;
|
||||
if (!file) { res.status(400).json({ error: "Missing audio field" }); return; }
|
||||
|
||||
const fmt = file.mimetype.includes("ogg") ? "ogg"
|
||||
: file.mimetype.includes("wav") ? "wav"
|
||||
: "webm";
|
||||
|
||||
const result = await svc.transcribe(file.buffer, fmt);
|
||||
res.json(result);
|
||||
});
|
||||
|
||||
router.post("/synthesize", async (req, res) => {
|
||||
assertBoard(req);
|
||||
const { text, voiceId } = req.body as { text?: string; voiceId?: string };
|
||||
if (!text || typeof text !== "string") {
|
||||
res.status(400).json({ error: "text is required" }); return;
|
||||
}
|
||||
const audioBuffer = await svc.synthesize(text, voiceId);
|
||||
res.setHeader("Content-Type", "audio/wav");
|
||||
res.send(audioBuffer);
|
||||
});
|
||||
|
||||
return router;
|
||||
}
|
||||
```
|
||||
|
||||
### Mount in app.ts
|
||||
|
||||
```typescript
|
||||
// server/src/app.ts — add alongside other api.use() calls
|
||||
import { voiceRoutes } from "./routes/voice.js";
|
||||
// ...
|
||||
api.use(voiceRoutes());
|
||||
```
|
||||
|
||||
### Remove /transcribe from chat-files.ts
|
||||
|
||||
Delete lines 297–386 (the inline `audioUpload` multer instance, `runAudioUpload` helper, and the `router.post("/transcribe", ...)` handler). The endpoint is now owned by `voice.ts`. No other code in `chat-files.ts` references these lines.
|
||||
|
||||
### voiceMode injection in chat.ts stream endpoint
|
||||
|
||||
```typescript
|
||||
// In POST /conversations/:id/stream handler, after resolving settings/memory/puter token:
|
||||
const { content, agentId, voiceMode } = req.body as {
|
||||
content: string; agentId?: string; voiceMode?: "text" | "voice_input" | "full_voice";
|
||||
};
|
||||
|
||||
// ... (existing message building logic) ...
|
||||
|
||||
if (voiceMode === "full_voice") {
|
||||
messagesWithMemory.push({
|
||||
role: "system",
|
||||
content: [
|
||||
"Format your response with EXACTLY these two labeled sections:",
|
||||
"",
|
||||
"SPOKEN: [Natural speech prose only. No markdown. No bullet points. Max 2-3 sentences.]",
|
||||
"",
|
||||
"DETAILED: [Full response with all detail, code blocks, and markdown formatting.]",
|
||||
].join("\n"),
|
||||
});
|
||||
}
|
||||
|
||||
// ... (existing token stream loop) ...
|
||||
|
||||
// When persisting the assistant reply, encode voiceMode in messageType:
|
||||
const message = await svc.addMessage(req.params.id!, {
|
||||
role: "assistant",
|
||||
content: fullContent.trim(),
|
||||
agentId: agentId || undefined,
|
||||
messageType: voiceMode === "full_voice" ? "voice_full"
|
||||
: voiceMode === "voice_input" ? "voice_input"
|
||||
: undefined,
|
||||
});
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Environment Availability
|
||||
|
||||
| Dependency | Required By | Available | Version | Fallback |
|
||||
|------------|-------------|-----------|---------|----------|
|
||||
| ffmpeg-static (npm) | VPIPE-04 | Not installed | — | Must install: `pnpm add ffmpeg-static` |
|
||||
| ffmpeg (CLI/system) | VPIPE-04 | Not in PATH | — | Provided by ffmpeg-static package |
|
||||
| whisper-cpp | VPIPE-01 | Not confirmed in PATH | — | openai-whisper Python CLI (existing cascade) |
|
||||
| piper (CLI) | VPIPE-02 | Not confirmed in PATH | — | 503 response with message; non-blocking for other requirements |
|
||||
| multer | VPIPE-01, VPIPE-03 | Yes (server/package.json ^2.0.2) | 2.0.2 | — |
|
||||
| Node.js child_process | All | Yes (built-in) | — | — |
|
||||
|
||||
**Missing dependencies with no fallback:**
|
||||
- `ffmpeg-static` — must be added to `server/package.json` as the first task. One `pnpm add ffmpeg-static` command resolves this. VPIPE-04 is blocked until installed.
|
||||
|
||||
**Missing dependencies with fallback:**
|
||||
- `piper` CLI — not confirmed installed. `synthesize()` should return HTTP 503 with a descriptive message when piper binary is absent, same defensive pattern as the existing whisper cascade. VPIPE-02 verification requires piper to be installed on the target machine.
|
||||
- `whisper-cpp` — openai-whisper Python CLI fallback already exists in the codebase and covers VPIPE-01.
|
||||
|
||||
---
|
||||
|
||||
## Validation Architecture
|
||||
|
||||
### Test Framework
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| Framework | vitest ^3.0.5 |
|
||||
| Config file | `/opt/nexus/vitest.config.ts` (environment: node) |
|
||||
| Quick run command | `cd /opt/nexus && pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts` |
|
||||
| Full suite command | `cd /opt/nexus && pnpm --filter @paperclipai/server test --run` |
|
||||
|
||||
### Phase Requirements to Test Map
|
||||
|
||||
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|
||||
|--------|----------|-----------|-------------------|-------------|
|
||||
| VPIPE-01 | `transcribe()` calls whisper and returns `{ text, language }` | unit | `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts` | Wave 0 |
|
||||
| VPIPE-02 | `synthesize()` returns a Buffer; respects 8s timeout | unit | `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts` | Wave 0 |
|
||||
| VPIPE-03 | `voicePipelineService` consumed by voice routes without HTTP round-trip | unit | `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-routes.test.ts` | Wave 0 |
|
||||
| VPIPE-04 | `transcodeToWav16k()` spawns ffmpeg with `-ar 16000 -ac 1` | unit (mock spawn) | `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts` | Wave 0 |
|
||||
| VPIPE-05 | `createMessageSchema` accepts and preserves `voiceMode` field | unit | `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-schema.test.ts` | Wave 0 |
|
||||
| VPIPE-06 | `formatForVoice()` strips markdown; fallback activates when SPOKEN marker absent | unit | `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts` | Wave 0 |
|
||||
|
||||
### Sampling Rate
|
||||
|
||||
- **Per task commit:** `pnpm --filter @paperclipai/server test --run src/__tests__/36-*.test.ts`
|
||||
- **Per wave merge:** `pnpm --filter @paperclipai/server test --run`
|
||||
- **Phase gate:** Full suite green before `/gsd:verify-work`
|
||||
|
||||
### Wave 0 Gaps
|
||||
|
||||
- [ ] `server/src/__tests__/36-voice-pipeline.test.ts` — unit tests for VPIPE-01, VPIPE-02, VPIPE-04, VPIPE-06 (mock execFile and spawn)
|
||||
- [ ] `server/src/__tests__/36-voice-routes.test.ts` — supertest tests for POST /api/transcribe and POST /api/synthesize (mock voicePipelineService)
|
||||
- [ ] `server/src/__tests__/36-voice-schema.test.ts` — Zod validator tests for voiceMode field on createMessageSchema and nexus-settings schema extension
|
||||
|
||||
---
|
||||
|
||||
## State of the Art
|
||||
|
||||
| Old Approach | Current Approach | When Changed | Impact |
|
||||
|--------------|------------------|--------------|--------|
|
||||
| fluent-ffmpeg | ffmpeg-static + spawn | May 2025 (fluent-ffmpeg archived) | Do not use fluent-ffmpeg in any new code |
|
||||
| Raw WebM passed to whisper-cpp | WebM/OGG transcoded to WAV 16kHz via ffmpeg first | Phase 36 (this phase) | More reliable transcription across whisper build variants |
|
||||
| Transcription logic in chat-files.ts route | Transcription in VoicePipelineService | Phase 36 (this phase) | Enables Telegram bridge to reuse STT without HTTP round-trip |
|
||||
|
||||
**Deprecated/outdated:**
|
||||
- `fluent-ffmpeg`: archived May 22 2025; do not install
|
||||
- `@ffmpeg-installer/ffmpeg`: ships FFmpeg 4.x; outdated codec support
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Does smart-whisper accept OGG or WebM directly without transcoding?**
|
||||
- What we know: The existing route passes raw WebM to whisper-cpp CLI and it sometimes works because whisper-cpp was compiled with libavformat support on some builds.
|
||||
- What's unclear: Whether this is reliable across all whisper-cpp build variants present on the Mac Mini.
|
||||
- Recommendation: Make this question moot — always transcode via ffmpeg (VPIPE-04 requires this explicitly). The pipe transcode adds ~50ms and eliminates the variability entirely.
|
||||
|
||||
2. **Where precisely should voiceMode be stored after message persistence?**
|
||||
- What we know: The `chat_messages` table has a `message_type` text column (free-text, no constraints). voiceMode is not a DB column.
|
||||
- What's unclear: Whether Phase 37 UI needs to query "was this a voice message?" from the DB, or whether the flag only matters during the live stream session.
|
||||
- Recommendation: Store voiceMode as the `messageType` value (e.g., `"voice_full"` or `"voice_input"`) so it survives to the DB. This satisfies the "voiceMode flag survives message persistence" success criterion without a DB migration. The `messageType` column is already free-text.
|
||||
|
||||
3. **Should `piperBinaryPath` and `whisperBinaryPath` be stored in nexus-settings?**
|
||||
- What we know: STATE.md notes that absolute binary paths should be stored in settings for service-mode reliability. Phase 39 onboarding will detect and configure these paths.
|
||||
- What's unclear: Whether Phase 36 should pre-provision these fields or defer entirely to Phase 39.
|
||||
- Recommendation: Add `piperBinaryPath: z.string().optional()` and `whisperBinaryPath: z.string().optional()` to nexus-settings schema in Phase 36. Default behavior: fall back to PATH lookup when the fields are absent. This preps the schema for Phase 39 onboarding without blocking Phase 36 delivery.
|
||||
|
||||
---
|
||||
|
||||
## Project Constraints (from CLAUDE.md)
|
||||
|
||||
No `CLAUDE.md` found in `/opt/nexus`. No project-level constraints beyond those captured in CONTEXT.md and STATE.md.
|
||||
|
||||
**Inferred from codebase conventions:**
|
||||
- All services use factory functions, not classes (confirmed across all `server/src/services/*.ts` files)
|
||||
- Use `promisify(execFile)` from `node:child_process`, never `exec` (confirmed: `git-file-service.ts`, `chat-files.ts`)
|
||||
- `spawn` with stdio array for pipe-based subprocesses (confirmed: no existing examples, but is the Node.js standard)
|
||||
- Zod for all schema validation — no other validator libraries in the codebase
|
||||
- Tests in `server/src/__tests__/`, named `NN-feature.test.ts` for phase-scoped tests
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- Direct codebase inspection: `server/src/routes/chat-files.ts` lines 297–386 — existing transcription pattern
|
||||
- Direct codebase inspection: `server/src/services/nexus-settings.ts` — Zod schema structure and extension pattern
|
||||
- Direct codebase inspection: `packages/shared/src/validators/chat.ts` — createMessageSchema fields and patterns
|
||||
- Direct codebase inspection: `packages/shared/src/types/chat.ts` — ChatMessage interface
|
||||
- Direct codebase inspection: `server/src/routes/chat.ts` lines 91–193 — stream endpoint where voiceMode must be injected
|
||||
- Direct codebase inspection: `server/src/app.ts` lines 164–165 — route mount pattern
|
||||
- Direct codebase inspection: `packages/db/src/schema/chat_messages.ts` — confirms no voiceMode column; message_type is free-text
|
||||
- Direct codebase inspection: `server/src/services/git-file-service.ts` — established `promisify(execFile)` pattern
|
||||
- `npm view ffmpeg-static version` returns `5.3.0` (verified 2026-04-03)
|
||||
- `npm view smart-whisper version` returns `0.8.1` (verified 2026-04-03)
|
||||
- `.planning/research/SUMMARY.md` — project-level v1.6 research, pitfalls 27–40
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- [ffmpeg-static GitHub](https://github.com/eugeneware/ffmpeg-static) — macOS arm64 binary confirmed, pipe invocation pattern documented
|
||||
- `.planning/STATE.md` — architectural decisions confirmed by project owner
|
||||
- `.planning/phases/36-voice-pipeline-foundation/36-CONTEXT.md` — locked implementation decisions
|
||||
|
||||
### Tertiary (LOW confidence — patterns inferred)
|
||||
- Dual output prompt reliability on 7B models: inferred from structured output community reports; not benchmarked on the specific Hermes model in use
|
||||
- Sentence chunking split regex: industry pattern from TTS pipeline implementations; not sourced from a canonical reference
|
||||
|
||||
---
|
||||
|
||||
## Metadata
|
||||
|
||||
**Confidence breakdown:**
|
||||
- Standard stack: HIGH — ffmpeg-static version verified via npm registry; all other libraries already in the codebase
|
||||
- Architecture: HIGH — based on direct codebase inspection; all integration points confirmed by reading actual source files
|
||||
- Pitfalls: HIGH — derived from both codebase analysis and project-level research in SUMMARY.md
|
||||
- Test map: MEDIUM — test file names follow the established `NN-feature.test.ts` naming pattern in `__tests__/`; test content is inferred from requirements, not existing test files
|
||||
|
||||
**Research date:** 2026-04-03
|
||||
**Valid until:** 2026-05-03 (stable domain — ffmpeg-static and whisper APIs change slowly)
|
||||
Loading…
Add table
Reference in a new issue