nexus/.planning/phases/36-voice-pipeline-foundation/36-RESEARCH.md
2026-04-04 03:55:49 +00:00

33 KiB
Raw Blame History

Phase 36: Voice Pipeline Foundation - Research

Researched: 2026-04-03 Domain: Server-side STT/TTS voice pipeline — ffmpeg transcoding, VoicePipelineService abstraction, dual output formatting, voiceMode flag propagation, nexus-settings schema extension Confidence: HIGH


<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

All implementation choices are at Claude's discretion — discuss phase was skipped per user setting. Use ROADMAP phase goal, success criteria, and codebase conventions to guide decisions.

Key research findings to incorporate:

  • VoicePipelineService as server-side service: transcribe(buffer, format), synthesize(text, voiceId?), formatForVoice(text)
  • Move /transcribe from chat-files.ts to new voice.ts route to reduce rebase conflict surface
  • Use ffmpeg-static ^5.2.0 (NOT archived fluent-ffmpeg) for WebM to WAV and OGG to WAV transcoding
  • Use execFile (not exec) for CLI subprocess calls — prevents shell injection
  • Wrap CLI calls (piper, ffmpeg) in Promise.race([call, timeout(8000)]) for graceful degradation
  • Voice mode flag must survive: client to Express to message persistence to agent session codec
  • Dual output: prompt engineering requests SPOKEN: [prose] plus DETAILED: [markdown] with post-processing strip as fallback
  • nexus-settings schema extension: voiceMode: "text" | "voice_input" | "full_voice", optional telegramToken
  • No DB migrations — all state in existing JSONB fields and file-backed JSON

Claude's Discretion

All implementation choices are at Claude's discretion — discuss phase was skipped.

Deferred Ideas (OUT OF SCOPE)

None — discuss phase skipped. </user_constraints>


<phase_requirements>

Phase Requirements

ID Description Research Support
VPIPE-01 User's voice input is transcribed via local Whisper STT with automatic language detection Existing whisper-cpp/openai-whisper cascade in chat-files.ts; move to VoicePipelineService.transcribe(); add -l auto flag for language detection
VPIPE-02 Agent text responses are synthesized to speech via local Piper TTS in under 3 seconds Piper via execFile with sentence chunking; warmup call on startup; 8s timeout via Promise.race
VPIPE-03 Voice pipeline accepts audio from any transport via a shared VoicePipelineService Service abstraction in server/src/services/voice-pipeline.ts; consumed by voice.ts routes and future Telegram service
VPIPE-04 Audio from any source transcoded to WAV 16kHz mono via ffmpeg before Whisper processing ffmpeg-static ^5.3.0; spawn with -ar 16000 -ac 1; pipe buffer through ffmpeg stdin/stdout
VPIPE-05 Voice mode flag on messages triggers voice-optimized response formatting Add voiceMode field to createMessageSchema Zod validator; pass through chat.ts stream endpoint; inject as system prompt instruction
VPIPE-06 Every voice interaction produces dual output: spoken prose + full text with code blocks Dual output via SPOKEN/DETAILED prompt template injected when voiceMode equals full_voice; post-process strip as fallback
</phase_requirements>

Summary

Phase 36 builds the transport-agnostic voice pipeline that all subsequent phases (37: Web Chat Voice UI, 38: Telegram Bridge) depend on. The work is purely server-side with zero UI changes. Three deliverables gate everything downstream: (1) VoicePipelineService in server/src/services/voice-pipeline.ts with transcribe(), synthesize(), and formatForVoice() methods; (2) server/src/routes/voice.ts with POST /api/transcribe (moved from chat-files.ts) and POST /api/synthesize; (3) voiceMode flag wired from createMessageSchema through the stream endpoint to the AI prompt.

The codebase already has a working Whisper cascade in chat-files.ts (lines 316386) that handles the transcription pattern. The main work is: extract that logic into VoicePipelineService, add ffmpeg transcoding before Whisper (currently missing — the route writes raw WebM to disk and passes it directly to whisper-cpp without explicit format conversion), add a synthesize method calling Piper, extend nexus-settings schema with voiceMode and telegramToken, and propagate voiceMode through the message pipeline.

Primary recommendation: Build VoicePipelineService first, then voice.ts routes, then schema changes in the shared package, then wire voiceMode through chat.ts stream endpoint. This ordering makes each deliverable independently testable before the next is built.


Standard Stack

Core

Library Version Purpose Why Standard
ffmpeg-static ^5.3.0 Ships FFmpeg 6.1.1 binaries for macOS arm64/Linux; no system ffmpeg required fluent-ffmpeg archived May 2025; ffmpeg-static is the maintained replacement
smart-whisper ^0.8.1 Node bindings for whisper.cpp with Apple Silicon acceleration Already in codebase (used in chat-files cascade fallback); avoid re-implementing
zod ^3.24.2 Schema validation for settings extension and message validators Already used throughout — do not introduce a second validator library
vitest ^3.0.5 Unit tests for VoicePipelineService and voice routes Already the test framework (vitest.config.ts in server root)

Supporting

Library Version Purpose When to Use
multer ^2.0.2 Multipart audio upload handling Already used in chat-files.ts; reuse the same pattern in voice.ts
node:child_process built-in execFile wrapper for piper and ffmpeg CLI calls Use promisify(execFile) — same pattern as git-file-service.ts and existing transcribe route

Alternatives Considered

Instead of Could Use Tradeoff
ffmpeg-static fluent-ffmpeg fluent-ffmpeg archived May 2025 — do not use
ffmpeg-static @ffmpeg-installer/ffmpeg Ships FFmpeg 4.x, not 6.x — older codec support
execFile subprocess ffmpeg npm bindings No mature maintained binding; CLI approach is industry standard

Installation:

cd /opt/nexus/server && pnpm add ffmpeg-static
cd /opt/nexus/server && pnpm add -D @types/ffmpeg-static

Version verification: npm view ffmpeg-static version returns 5.3.0 (verified 2026-04-03). Use ^5.3.0 as the lower bound (newer than the ^5.2.0 in the CONTEXT.md decisions, but backwards compatible).


Architecture Patterns

server/src/
├── services/
│   └── voice-pipeline.ts      # NEW: VoicePipelineService (transcribe, synthesize, formatForVoice)
├── routes/
│   ├── voice.ts               # NEW: POST /api/transcribe, POST /api/synthesize
│   ├── chat-files.ts          # MODIFIED: remove POST /transcribe block (lines 297-386)
│   └── chat.ts                # MODIFIED: pass voiceMode through stream endpoint

packages/shared/src/
├── validators/
│   └── chat.ts                # MODIFIED: add voiceMode to createMessageSchema
└── types/
    └── chat.ts                # MODIFIED: add voiceMode to ChatMessage interface

server/src/services/
└── nexus-settings.ts          # MODIFIED: add voiceMode + telegramToken fields

Pattern 1: VoicePipelineService Structure

What: Factory function returning a service object — matches the existing codebase pattern (all services use factory functions, not classes).

When to use: Any server-side code needing STT, TTS, or voice formatting.

// server/src/services/voice-pipeline.ts
import { execFile as execFileCb, spawn } from "node:child_process";
import { promisify } from "node:util";
import { tmpdir } from "node:os";
import { writeFile, unlink } from "node:fs/promises";
import path from "node:path";
import ffmpegPath from "ffmpeg-static";

const execFile = promisify(execFileCb);  // same pattern as git-file-service.ts

function withTimeout<T>(promise: Promise<T>, ms: number): Promise<T> {
  return Promise.race([
    promise,
    new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error(`Timed out after ${ms}ms`)), ms)
    ),
  ]);
}

export function voicePipelineService() {
  // Assert at construction time — fail fast rather than at first request
  if (!ffmpegPath) throw new Error("ffmpeg-static binary not found on this platform");

  async function transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer> {
    // Uses spawn with stdin/stdout pipes — no temp files, no shell expansion
    // See Pattern 2 for full implementation
  }

  async function transcribe(
    buffer: Buffer,
    format: "webm" | "ogg" | "wav",
  ): Promise<{ text: string; language?: string }> {
    // 1. Transcode to WAV 16kHz mono (skip if already wav)
    // 2. Write WAV to temp file
    // 3. Run whisper-cpp with --language auto, or openai-whisper fallback
    // 4. Clean up temp file in finally block
  }

  async function synthesize(text: string, voiceId?: string): Promise<Buffer> {
    // 1. Chunk text into sentences
    // 2. Run piper per chunk via execFile (not exec)
    // 3. Concatenate WAV buffers
  }

  function formatForVoice(text: string): string {
    // Strip markdown: headings, bold, italic, code fences, bullet points
    // Fallback for when SPOKEN/DETAILED markers are absent
  }

  return { transcribe, synthesize, formatForVoice };
}

Pattern 2: ffmpeg Transcode via Pipe (Buffer In, Buffer Out)

What: Transcode audio buffer to WAV 16kHz mono without writing input to disk. Uses spawn with stdio pipes — this is not exec and does not expand shell metacharacters.

When to use: Any audio buffer received from HTTP multipart or Telegram file download.

// Source: ffmpeg-static GitHub + Node.js child_process docs
import { spawn } from "node:child_process";
import ffmpegPath from "ffmpeg-static";

function transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer> {
  return new Promise((resolve, reject) => {
    // ffmpegPath is a string (asserted at service construction), inputFormat from multer mimetype
    const ff = spawn(
      ffmpegPath as string,
      [
        "-f", inputFormat,  // e.g. "webm" or "ogg" — from multer file.mimetype, not user input
        "-i", "pipe:0",     // read from stdin
        "-ar", "16000",     // 16kHz sample rate — required by Whisper
        "-ac", "1",         // mono
        "-f", "wav",
        "pipe:1",           // write to stdout
      ],
      { stdio: ["pipe", "pipe", "pipe"] },
    );
    const chunks: Buffer[] = [];
    ff.stdout.on("data", (chunk: Buffer) => chunks.push(chunk));
    ff.on("close", (code) => {
      if (code === 0) resolve(Buffer.concat(chunks));
      else reject(new Error(`ffmpeg exited with code ${code}`));
    });
    ff.on("error", reject);
    ff.stdin.write(inputBuffer);
    ff.stdin.end();
  });
}

Security note: spawn (unlike exec) does not invoke a shell. Arguments are passed as an array — no shell expansion occurs. inputFormat is derived from file.mimetype set by multer, not from raw user input.

Pattern 3: nexus-settings Schema Extension

What: Extend Zod schema in nexus-settings.ts — existing pattern uses z.object() with .default() values. New fields use .default() or .optional() so existing nexus-settings.json files parse without error.

// MODIFIED: server/src/services/nexus-settings.ts
export const VOICE_MODES = ["text", "voice_input", "full_voice"] as const;
export type VoiceMode = (typeof VOICE_MODES)[number];

const nexusSettingsSchema = z.object({
  mode: z.enum(NEXUS_MODES).default("both"),
  voiceEnabled: z.boolean().default(false),
  voiceMode: z.enum(VOICE_MODES).default("text"),     // NEW
  telegramToken: z.string().optional(),                // NEW
});

No file migration needed — Zod .default() and .optional() handle missing fields in existing nexus-settings.json gracefully.

Pattern 4: voiceMode in createMessageSchema

What: Add optional voiceMode field to the shared Zod validator. Matches the same pattern as messageType which is already optional.

// MODIFIED: packages/shared/src/validators/chat.ts
export const createMessageSchema = z.object({
  role: z.enum(["user", "assistant", "system"]),
  content: z.string().min(1).max(100_000),
  agentId: z.string().uuid().optional(),
  messageType: z.string().optional(),
  voiceMode: z.enum(["text", "voice_input", "full_voice"]).optional(),  // NEW
});

On persistence: The chatMessages DB table has a message_type text column with no constraints. Store voiceMode there (e.g., value "voice_full" or "voice_input") so it survives the request boundary and is queryable by Phase 37 UI. No DB migration needed — the column already exists as a free-text field.

Pattern 5: Dual Output via Prompt Engineering

What: Inject a system prompt suffix when voiceMode === "full_voice" that instructs the AI to produce two labeled sections.

// In the stream endpoint (chat.ts), after resolving memory and before the token loop:
const { content, agentId, voiceMode } = req.body as {
  content: string; agentId?: string; voiceMode?: string;
};

if (voiceMode === "full_voice") {
  messagesWithMemory.push({
    role: "system",
    content: [
      "Format your response with EXACTLY these two labeled sections:",
      "",
      "SPOKEN: [Natural speech prose. No markdown. No bullet points. No code blocks. 2-3 sentences for spoken delivery.]",
      "",
      "DETAILED: [Your full response with markdown, code blocks, and all detail.]",
    ].join("\n"),
  });
}

Post-processing fallback in formatForVoice(): if AI response does not contain the SPOKEN: marker, strip markdown symbols from the full content and use it as the spoken text. This handles the ~10% format failure rate on smaller models.

Anti-Patterns to Avoid

  • Using exec instead of execFile: exec passes the command through a shell, enabling injection. Always use execFile with an array of arguments. The codebase convention confirmed in git-file-service.ts and chat-files.ts is promisify(execFileCb).
  • Temp file leaks: If a temp WAV file is written before passing to whisper, cleanup must be in a finally block. The pipe approach in Pattern 2 avoids this entirely for the ffmpeg step.
  • fluent-ffmpeg: Archived May 2025 — do not install or reference.
  • Spawning piper on full multi-paragraph text: Piper silently truncates responses over ~400 characters. Chunk into sentences before synthesis.
  • Missing ffmpegPath null check: ffmpeg-static returns null on unsupported platforms. Assert at service construction time, not at call time.

Don't Hand-Roll

Problem Don't Build Use Instead Why
FFmpeg binary distribution Custom binary download logic ffmpeg-static ^5.3.0 Ships FFmpeg 6.1.1 binaries for macOS arm64 + Linux; resolves binary path automatically
Markdown stripping for TTS Regex soup formatForVoice() with an explicit targeted strip list Regex soup breaks on edge cases; keep the strip list explicit and unit-tested
Audio format detection Magic byte inspection Use file.mimetype from multer multer already validates and populates mimetype; trust it for format selection
Sentence chunking NLP library Simple .split(/(?<=[.!?])\s+/) with length cap at 100 chars Works for 95% of responses; no dependency needed for Phase 36

Key insight: The hardest part of this phase is not the audio processing — it is ensuring every layer of the message pipeline respects the voiceMode flag. Audit the full chain (request body → Zod parse → addMessage → stream endpoint prompt injection) before building the dual output feature on top of it.


Common Pitfalls

Pitfall 1: ffmpeg Not Found at Runtime

What goes wrong: spawn(ffmpegPath, ...) throws ENOENT in production or service environments.

Why it happens: ffmpeg-static returns null on platforms without a prebuilt binary. The returned path must be used directly — do not re-resolve via PATH.

How to avoid: At service construction, assert if (!ffmpegPath) throw new Error("ffmpeg-static binary not found"). Add a startup smoke test: spawn ffmpeg with -version and log the result. Fail fast rather than failing on first request.

Warning signs: Error: spawn null ENOENT or TypeError: path must be a string.

Pitfall 2: Raw WebM Sent Directly to Whisper Without Transcoding

What goes wrong: The existing chat-files.ts transcription route writes raw WebM to disk and passes it to whisper-cpp without transcoding. This works accidentally when whisper-cpp was compiled with libavformat, but fails silently or returns garbage on other builds.

Why it happens: The original code relied on whisper-cpp's built-in demuxer rather than explicit format conversion.

How to avoid: Always transcode to WAV 16kHz mono via ffmpeg first, regardless of whether the input format might work natively. VPIPE-04 requires this explicitly. The pipe approach handles both WebM and OGG paths identically.

Warning signs: Transcription returns empty string or garbled text on WebM or OGG input.

Pitfall 3: voiceMode Flag Stripped at addMessage()

What goes wrong: The stream endpoint reads voiceMode from req.body and uses it for the system prompt, but then calls svc.addMessage() which only accepts { role, content, agentId, messageType }. The flag is silently dropped and never stored.

Why it happens: createMessageSchema in the shared validators does not include voiceMode, so Zod strips it.

How to avoid: Add voiceMode to createMessageSchema as an optional field. In addMessage(), store the value in the messageType column (e.g., pass messageType: voiceMode when voiceMode is set). The DB message_type column is a free-text field with no constraints — it can store any string value.

Warning signs: Phase 37 UI cannot determine which messages were voice messages when rendering chat history.

Pitfall 4: Piper Process Reload Per Request

What goes wrong: Each synthesize() call spawns a new piper process. Piper loads the ONNX model fresh each time (200800ms overhead). Responses over ~400 characters are silently truncated.

Why it happens: Standard one-shot execFile pattern for CLI tools.

How to avoid for Phase 36: Chunk text into sentences (max ~100 chars per chunk) before calling piper. Add a warmup call at server startup. A persistent piper process architecture is deferred to Phase 39 — sentence chunking is the correct Phase 36 mitigation that keeps VPIPE-02 within 3 seconds for typical responses.

Warning signs: VPIPE-02 latency exceeds 3 seconds. Final sentence of long responses missing from audio output.

Pitfall 5: Dual Output Markers Absent on Smaller Models

What goes wrong: The AI responds without SPOKEN: and DETAILED: section markers (~10% of calls on 7B-class models). synthesize() receives the full markdown response and speaks the markdown symbols aloud.

Why it happens: Smaller models have lower format-adherence reliability on structured output prompts.

How to avoid: Implement post-processing fallback in formatForVoice(): if the SPOKEN: marker is absent in the response, strip markdown and use the full content. The dual output prompt is Approach A; the strip fallback is Approach B. Both must be implemented — B is a required safety net, not optional.

Warning signs: TTS receives content containing asterisk asterisk or triple-backtick code fence text.

Pitfall 6: multer Audio Upload Config Not Exported from chat-files.ts

What goes wrong: When creating voice.ts, a developer attempts to import the audioUpload multer instance from chat-files.ts. It is defined inline and not exported.

Why it happens: The multer config for audio is scoped inside chatFileRoutes().

How to avoid: Define a fresh multer instance in voice.ts. Import MAX_ATTACHMENT_BYTES from ../attachment-types.js to keep the file size limit consistent across all upload endpoints.


Code Examples

Voice Route: POST /api/transcribe and POST /api/synthesize

// server/src/routes/voice.ts
import { Router } from "express";
import multer from "multer";
import { assertBoard } from "./authz.js";
import { voicePipelineService } from "../services/voice-pipeline.js";
import { MAX_ATTACHMENT_BYTES } from "../attachment-types.js";

const audioUpload = multer({
  storage: multer.memoryStorage(),
  limits: { fileSize: MAX_ATTACHMENT_BYTES, files: 1 },
});

export function voiceRoutes(): Router {
  const router = Router();
  const svc = voicePipelineService();

  router.post("/transcribe", async (req, res) => {
    assertBoard(req);
    await new Promise<void>((resolve, reject) =>
      audioUpload.single("audio")(req, res, (err) => (err ? reject(err) : resolve()))
    );
    const file = (req as any).file as { buffer: Buffer; mimetype: string } | undefined;
    if (!file) { res.status(400).json({ error: "Missing audio field" }); return; }

    const fmt = file.mimetype.includes("ogg") ? "ogg"
      : file.mimetype.includes("wav") ? "wav"
      : "webm";

    const result = await svc.transcribe(file.buffer, fmt);
    res.json(result);
  });

  router.post("/synthesize", async (req, res) => {
    assertBoard(req);
    const { text, voiceId } = req.body as { text?: string; voiceId?: string };
    if (!text || typeof text !== "string") {
      res.status(400).json({ error: "text is required" }); return;
    }
    const audioBuffer = await svc.synthesize(text, voiceId);
    res.setHeader("Content-Type", "audio/wav");
    res.send(audioBuffer);
  });

  return router;
}

Mount in app.ts

// server/src/app.ts — add alongside other api.use() calls
import { voiceRoutes } from "./routes/voice.js";
// ...
api.use(voiceRoutes());

Remove /transcribe from chat-files.ts

Delete lines 297386 (the inline audioUpload multer instance, runAudioUpload helper, and the router.post("/transcribe", ...) handler). The endpoint is now owned by voice.ts. No other code in chat-files.ts references these lines.

voiceMode injection in chat.ts stream endpoint

// In POST /conversations/:id/stream handler, after resolving settings/memory/puter token:
const { content, agentId, voiceMode } = req.body as {
  content: string; agentId?: string; voiceMode?: "text" | "voice_input" | "full_voice";
};

// ... (existing message building logic) ...

if (voiceMode === "full_voice") {
  messagesWithMemory.push({
    role: "system",
    content: [
      "Format your response with EXACTLY these two labeled sections:",
      "",
      "SPOKEN: [Natural speech prose only. No markdown. No bullet points. Max 2-3 sentences.]",
      "",
      "DETAILED: [Full response with all detail, code blocks, and markdown formatting.]",
    ].join("\n"),
  });
}

// ... (existing token stream loop) ...

// When persisting the assistant reply, encode voiceMode in messageType:
const message = await svc.addMessage(req.params.id!, {
  role: "assistant",
  content: fullContent.trim(),
  agentId: agentId || undefined,
  messageType: voiceMode === "full_voice" ? "voice_full"
    : voiceMode === "voice_input" ? "voice_input"
    : undefined,
});

Environment Availability

Dependency Required By Available Version Fallback
ffmpeg-static (npm) VPIPE-04 Not installed Must install: pnpm add ffmpeg-static
ffmpeg (CLI/system) VPIPE-04 Not in PATH Provided by ffmpeg-static package
whisper-cpp VPIPE-01 Not confirmed in PATH openai-whisper Python CLI (existing cascade)
piper (CLI) VPIPE-02 Not confirmed in PATH 503 response with message; non-blocking for other requirements
multer VPIPE-01, VPIPE-03 Yes (server/package.json ^2.0.2) 2.0.2
Node.js child_process All Yes (built-in)

Missing dependencies with no fallback:

  • ffmpeg-static — must be added to server/package.json as the first task. One pnpm add ffmpeg-static command resolves this. VPIPE-04 is blocked until installed.

Missing dependencies with fallback:

  • piper CLI — not confirmed installed. synthesize() should return HTTP 503 with a descriptive message when piper binary is absent, same defensive pattern as the existing whisper cascade. VPIPE-02 verification requires piper to be installed on the target machine.
  • whisper-cpp — openai-whisper Python CLI fallback already exists in the codebase and covers VPIPE-01.

Validation Architecture

Test Framework

Property Value
Framework vitest ^3.0.5
Config file /opt/nexus/vitest.config.ts (environment: node)
Quick run command cd /opt/nexus && pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts
Full suite command cd /opt/nexus && pnpm --filter @paperclipai/server test --run

Phase Requirements to Test Map

Req ID Behavior Test Type Automated Command File Exists?
VPIPE-01 transcribe() calls whisper and returns { text, language } unit pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts Wave 0
VPIPE-02 synthesize() returns a Buffer; respects 8s timeout unit pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts Wave 0
VPIPE-03 voicePipelineService consumed by voice routes without HTTP round-trip unit pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-routes.test.ts Wave 0
VPIPE-04 transcodeToWav16k() spawns ffmpeg with -ar 16000 -ac 1 unit (mock spawn) pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts Wave 0
VPIPE-05 createMessageSchema accepts and preserves voiceMode field unit pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-schema.test.ts Wave 0
VPIPE-06 formatForVoice() strips markdown; fallback activates when SPOKEN marker absent unit pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts Wave 0

Sampling Rate

  • Per task commit: pnpm --filter @paperclipai/server test --run src/__tests__/36-*.test.ts
  • Per wave merge: pnpm --filter @paperclipai/server test --run
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • server/src/__tests__/36-voice-pipeline.test.ts — unit tests for VPIPE-01, VPIPE-02, VPIPE-04, VPIPE-06 (mock execFile and spawn)
  • server/src/__tests__/36-voice-routes.test.ts — supertest tests for POST /api/transcribe and POST /api/synthesize (mock voicePipelineService)
  • server/src/__tests__/36-voice-schema.test.ts — Zod validator tests for voiceMode field on createMessageSchema and nexus-settings schema extension

State of the Art

Old Approach Current Approach When Changed Impact
fluent-ffmpeg ffmpeg-static + spawn May 2025 (fluent-ffmpeg archived) Do not use fluent-ffmpeg in any new code
Raw WebM passed to whisper-cpp WebM/OGG transcoded to WAV 16kHz via ffmpeg first Phase 36 (this phase) More reliable transcription across whisper build variants
Transcription logic in chat-files.ts route Transcription in VoicePipelineService Phase 36 (this phase) Enables Telegram bridge to reuse STT without HTTP round-trip

Deprecated/outdated:

  • fluent-ffmpeg: archived May 22 2025; do not install
  • @ffmpeg-installer/ffmpeg: ships FFmpeg 4.x; outdated codec support

Open Questions

  1. Does smart-whisper accept OGG or WebM directly without transcoding?

    • What we know: The existing route passes raw WebM to whisper-cpp CLI and it sometimes works because whisper-cpp was compiled with libavformat support on some builds.
    • What's unclear: Whether this is reliable across all whisper-cpp build variants present on the Mac Mini.
    • Recommendation: Make this question moot — always transcode via ffmpeg (VPIPE-04 requires this explicitly). The pipe transcode adds ~50ms and eliminates the variability entirely.
  2. Where precisely should voiceMode be stored after message persistence?

    • What we know: The chat_messages table has a message_type text column (free-text, no constraints). voiceMode is not a DB column.
    • What's unclear: Whether Phase 37 UI needs to query "was this a voice message?" from the DB, or whether the flag only matters during the live stream session.
    • Recommendation: Store voiceMode as the messageType value (e.g., "voice_full" or "voice_input") so it survives to the DB. This satisfies the "voiceMode flag survives message persistence" success criterion without a DB migration. The messageType column is already free-text.
  3. Should piperBinaryPath and whisperBinaryPath be stored in nexus-settings?

    • What we know: STATE.md notes that absolute binary paths should be stored in settings for service-mode reliability. Phase 39 onboarding will detect and configure these paths.
    • What's unclear: Whether Phase 36 should pre-provision these fields or defer entirely to Phase 39.
    • Recommendation: Add piperBinaryPath: z.string().optional() and whisperBinaryPath: z.string().optional() to nexus-settings schema in Phase 36. Default behavior: fall back to PATH lookup when the fields are absent. This preps the schema for Phase 39 onboarding without blocking Phase 36 delivery.

Project Constraints (from CLAUDE.md)

No CLAUDE.md found in /opt/nexus. No project-level constraints beyond those captured in CONTEXT.md and STATE.md.

Inferred from codebase conventions:

  • All services use factory functions, not classes (confirmed across all server/src/services/*.ts files)
  • Use promisify(execFile) from node:child_process, never exec (confirmed: git-file-service.ts, chat-files.ts)
  • spawn with stdio array for pipe-based subprocesses (confirmed: no existing examples, but is the Node.js standard)
  • Zod for all schema validation — no other validator libraries in the codebase
  • Tests in server/src/__tests__/, named NN-feature.test.ts for phase-scoped tests

Sources

Primary (HIGH confidence)

  • Direct codebase inspection: server/src/routes/chat-files.ts lines 297386 — existing transcription pattern
  • Direct codebase inspection: server/src/services/nexus-settings.ts — Zod schema structure and extension pattern
  • Direct codebase inspection: packages/shared/src/validators/chat.ts — createMessageSchema fields and patterns
  • Direct codebase inspection: packages/shared/src/types/chat.ts — ChatMessage interface
  • Direct codebase inspection: server/src/routes/chat.ts lines 91193 — stream endpoint where voiceMode must be injected
  • Direct codebase inspection: server/src/app.ts lines 164165 — route mount pattern
  • Direct codebase inspection: packages/db/src/schema/chat_messages.ts — confirms no voiceMode column; message_type is free-text
  • Direct codebase inspection: server/src/services/git-file-service.ts — established promisify(execFile) pattern
  • npm view ffmpeg-static version returns 5.3.0 (verified 2026-04-03)
  • npm view smart-whisper version returns 0.8.1 (verified 2026-04-03)
  • .planning/research/SUMMARY.md — project-level v1.6 research, pitfalls 2740

Secondary (MEDIUM confidence)

  • ffmpeg-static GitHub — macOS arm64 binary confirmed, pipe invocation pattern documented
  • .planning/STATE.md — architectural decisions confirmed by project owner
  • .planning/phases/36-voice-pipeline-foundation/36-CONTEXT.md — locked implementation decisions

Tertiary (LOW confidence — patterns inferred)

  • Dual output prompt reliability on 7B models: inferred from structured output community reports; not benchmarked on the specific Hermes model in use
  • Sentence chunking split regex: industry pattern from TTS pipeline implementations; not sourced from a canonical reference

Metadata

Confidence breakdown:

  • Standard stack: HIGH — ffmpeg-static version verified via npm registry; all other libraries already in the codebase
  • Architecture: HIGH — based on direct codebase inspection; all integration points confirmed by reading actual source files
  • Pitfalls: HIGH — derived from both codebase analysis and project-level research in SUMMARY.md
  • Test map: MEDIUM — test file names follow the established NN-feature.test.ts naming pattern in __tests__/; test content is inferred from requirements, not existing test files

Research date: 2026-04-03 Valid until: 2026-05-03 (stable domain — ffmpeg-static and whisper APIs change slowly)