Nexus Dev d9c628fe38 docs(34): research phase voice domain

2026-04-03 22:23:50 +00:00

24 KiB

Raw Blame History

Phase 34: Voice - Research

Researched: 2026-04-01 Domain: Browser STT (Whisper via smart-whisper), Browser TTS (Piper via @mintplex-labs/piper-tts-web WASM), Onboarding voice step Confidence: MEDIUM

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

None — all implementation choices are at Claude's discretion.

Claude's Discretion

All implementation choices are at Claude's discretion.

Deferred Ideas (OUT OF SCOPE)

None. </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
VOICE-01	User gets Piper TTS speech output that works on CPU-only hardware	@mintplex-labs/piper-tts-web runs entirely in browser WASM via ONNX Runtime — no GPU needed
VOICE-02	Piper TTS pre-warms on first use with visible download progress (no silent 15-30s hang)	`tts.download(voiceId, progressCallback)` API provides loaded/total bytes; render a progress bar before calling `predict()`
VOICE-03	Voice features (Whisper STT + Piper TTS) offered during onboarding based on hardware capability	NexusOnboardingWizard currently has 5 steps; add a step 4 (voice) gated on `hardwareInfo.hardwareTier !== undefined`; all tiers can run voice since it is purely CPU-bound WASM
</phase_requirements>

Summary

Phase 34 adds two voice capabilities: speech-to-text (STT) via Whisper, and text-to-speech (TTS) via Piper, plus an onboarding step where users can opt into voice features.

The STT side already has a server route (POST /api/transcribe in chat-files.ts) and a VoiceRecordButton component that calls it. The route is implemented correctly but has a critical gap: it is exported from routes/index.ts but never registered in app.ts, so POST /api/transcribe returns 404 at runtime. Fixing this registration is the primary STT task.

For TTS, the project currently has zero Piper integration. The recommended approach is browser-side WASM via @mintplex-labs/piper-tts-web (v1.0.4, MIT). This library wraps the Piper ONNX models in WebAssembly so synthesis runs on-device without a server round-trip, satisfying VOICE-01 (CPU-only hardware). The key UX concern (VOICE-02) is a 10-50 MB model download that blocks first synthesis — the library provides a download() method with a progress callback that must be wired to a visible UI element before calling predict().

The onboarding voice step (VOICE-03) should be inserted into NexusOnboardingWizard.tsx as step 4 (shifting the existing "root directory" step to 5 and "summary" to 6). The step should probe mic permission availability and detect whether the browser supports MediaRecorder to inform the user, then offer a "yes, enable voice" / "skip" choice. Since all hardware tiers can run browser-WASM TTS, the gate is not tier-based — it is browser-capability-based.

Primary recommendation: Register chatFileRoutes in app.ts to fix STT; add @mintplex-labs/piper-tts-web for browser-side TTS with a progress-bar pre-warm flow; add a voice opt-in step in NexusOnboardingWizard.

Standard Stack

Core

Library	Version	Purpose	Why Standard
`@mintplex-labs/piper-tts-web`	1.0.4	Browser-side Piper TTS via WASM/ONNX	Browser-only, no server infra, CPU-safe, actively maintained fork used in AnythingLLM
`smart-whisper`	0.8.1	Native Node.js Whisper.cpp binding for STT	Auto-downloads models, Metal on Apple Silicon, used as drop-in replacement for CLI approach

Supporting

Library	Version	Purpose	When to Use
`node-wav`	0.0.2	Decode WAV buffer to Float32Array for smart-whisper	Required: smart-whisper only accepts 16kHz Float32Array PCM, not raw webm

Note on audio conversion: The browser sends audio/webm;codecs=opus. smart-whisper requires 16kHz mono Float32Array PCM. ffmpeg is not present on this machine. The existing /transcribe route writes a temp .webm file and calls the whisper or whisper-cpp CLI — this works when those CLIs are installed. If upgrading to smart-whisper, a conversion step is required. The server-side ffmpeg is not available, so either: (a) require ffmpeg as an install-time dep via fluent-ffmpeg + system ffmpeg, or (b) keep the CLI-fallback pattern in the existing route and just fix the route registration rather than rewriting the transcription logic. Option (b) is lower risk.

Alternatives Considered

Instead of	Could Use	Tradeoff
`@mintplex-labs/piper-tts-web` (browser)	`piper` Python CLI via server route	CLI requires Python + model install; adds server complexity; VOICE-F01 deferred to future
Fix route registration	Rewrite transcription with `smart-whisper`	smart-whisper requires PCM conversion (no ffmpeg on host); high risk; the existing CLI fallback is simpler

Installation (UI):

pnpm --filter @paperclipai/ui add @mintplex-labs/piper-tts-web

No new server deps needed for the minimal fix (just registering existing route). If upgrading to smart-whisper in a future phase:

pnpm --filter @paperclipai/server add smart-whisper node-wav

Version verification (confirmed against npm registry 2026-04-01):

@mintplex-labs/piper-tts-web: 1.0.4 (latest)
smart-whisper: 0.8.1 (latest)
node-wav: 0.0.2 (latest)

Architecture Patterns

Recommended Project Structure

server/src/
├── app.ts                    # ADD: chatFileRoutes registration (1-line fix)
├── routes/
│   └── chat-files.ts         # Existing /transcribe route — no changes needed
│   └── voice.ts              # Optional: extract a dedicated voice route if /synthesize added

ui/src/
├── components/
│   ├── VoiceRecordButton.tsx  # Existing — no changes needed once server route is fixed
│   ├── TtsButton.tsx          # NEW: speaker icon button that calls piper-tts-web predict()
│   └── onboarding/
│       └── VoiceStep.tsx      # NEW: opt-in step for voice features
├── hooks/
│   └── usePiperTts.ts         # NEW: singleton TtsSession, download(), predict(), status
├── NexusOnboardingWizard.tsx  # MODIFY: insert step 4 (voice), shift steps 4→5, 5→6

Pattern 1: Route Registration Fix (STT)

What: chatFileRoutes is defined and exported but never registered in app.ts. Add one import and one api.use() call.

When to use: This is the only required change for STT to function.

Example:

// server/src/app.ts — add after line ~31 (other imports)
import { chatFileRoutes } from "./routes/chat-files.js";

// ...inside createApp, after api.use(assistantHandoffRoutes(db)):
api.use(chatFileRoutes(db, opts.storageService));

The chatFileRoutes function signature: chatFileRoutes(db: Db, storage: StorageService). In app.ts, opts.storageService is the storage argument.

Pattern 2: Piper TTS Hook (Browser-Side WASM)

What: A React hook wrapping @mintplex-labs/piper-tts-web that manages model download state and synthesis. The model download is the pre-warm step that prevents the silent 15-30s hang on first synthesis.

When to use: Any component that needs to read assistant responses aloud.

Example:

// ui/src/hooks/usePiperTts.ts
import { tts } from "@mintplex-labs/piper-tts-web";

const DEFAULT_VOICE = "en_US-hfc_female-medium";

export function usePiperTts() {
  const [status, setStatus] = useState<"idle" | "downloading" | "ready" | "speaking">("idle");
  const [progress, setProgress] = useState(0); // 0–100

  async function prewarm() {
    setStatus("downloading");
    const stored = await tts.stored();
    if (!stored.includes(DEFAULT_VOICE)) {
      await tts.download(DEFAULT_VOICE, (p) => {
        setProgress(Math.round((p.loaded / p.total) * 100));
      });
    }
    setStatus("ready");
  }

  async function speak(text: string) {
    if (status !== "ready") return;
    setStatus("speaking");
    const wav = await tts.predict({ text, voiceId: DEFAULT_VOICE });
    const audio = new Audio(wav);
    audio.onended = () => setStatus("ready");
    audio.play();
  }

  return { status, progress, prewarm, speak };
}

Key points:

tts.predict() returns a Blob URL (WAV format). Use new Audio(blobUrl).play() — simplest approach, no Web Audio API needed.
tts.stored() checks IndexedDB cache; download is skipped if model already present.
The library is browser-only. Do not import in server code.

Pattern 3: Onboarding Voice Step

What: Add a step 4 in NexusOnboardingWizard.tsx that shows STT+TTS capability, checks mic permission, and lets users opt in. Because piper-tts-web is CPU-safe WASM, the gate is browser capability (navigator.mediaDevices), not hardware tier.

When to use: VOICE-03 requirement — offer voice during onboarding.

Step numbering shift:

Current: 1=hardware, 2=mode, 3=provider, 4=rootDir, 5=summary
New: 1=hardware, 2=mode, 3=provider, 4=voice, 5=rootDir, 6=summary
Update Step type from 1 | 2 | 3 | 4 | 5 to 1 | 2 | 3 | 4 | 5 | 6
Update "Step X of 5" label to "Step X of 5" (keep label at 5 since summary is a bonus; or "Step X of 6")
Update all setStep() calls to use new numbers

Voice opt-in state to track:

const [voiceEnabled, setVoiceEnabled] = useState(false);

Store in nexus-settings.json via a new field (e.g., voiceEnabled: boolean) if persistence across sessions is desired. Or store in localStorage if the no-DB-schema constraint applies (it does — no schema changes, use file-backed JSON).

Example VoiceStep component structure:

// ui/src/components/onboarding/VoiceStep.tsx
export function VoiceStep({ onEnable, onSkip }: VoiceStepProps) {
  const [micAvailable, setMicAvailable] = useState<boolean | null>(null);

  useEffect(() => {
    // Non-blocking probe: does browser support mic?
    navigator.mediaDevices?.enumerateDevices()
      .then(devices => setMicAvailable(devices.some(d => d.kind === "audioinput")))
      .catch(() => setMicAvailable(false));
  }, []);

  return (
    <>
      <h1>Voice features</h1>
      <p>Speak to your assistant (Whisper STT) and hear responses read aloud (Piper TTS). Runs entirely on your device.</p>
      {micAvailable === false && (
        <p className="text-muted-foreground text-sm">No microphone detected — STT unavailable, but TTS still works.</p>
      )}
      <Button onClick={onEnable}>Enable voice</Button>
      <Button variant="ghost" onClick={onSkip}>Skip</Button>
    </>
  );
}

Anti-Patterns to Avoid

Importing piper-tts-web in Node.js: The library explicitly does not support Node.js. It must only be imported in browser code (UI package). Vite will not include it in the server bundle.
Calling tts.predict() before downloading the model: Results in a 15-30s silent hang. Always call tts.download() first (or check tts.stored()), show progress, then call predict().
Registering /transcribe before auth middleware: The existing /transcribe route calls assertBoard(req) — it must sit inside the api sub-router (after boardMutationGuard), not before it. The chatFileRoutes call belongs at line ~161 of app.ts alongside other api.use() calls.
Using new Audio() with a raw Buffer: tts.predict() returns a Blob URL string — pass it directly to new Audio(url), not new Audio(Buffer).

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
TTS synthesis in browser	Custom ONNX loader + Piper WASM integration	`@mintplex-labs/piper-tts-web`	Already bundles ort-wasm, phenomizer, model management — 492KB package handles all of it
Model download progress	Manual fetch with XHR progress	`tts.download(voiceId, progressCb)`	Built-in progress callback, automatic IndexedDB caching
PCM audio decoding for Whisper	Custom webm→PCM decoder	Keep CLI fallback in existing `/transcribe` route	No ffmpeg on host; smart-whisper requires PCM — adding conversion is out-of-scope for this phase
Mic permission detection	Custom navigator probe	`navigator.mediaDevices.enumerateDevices()`	Native browser API, no library needed

Key insight: The browser handles TTS completely — no server-side Piper install needed for VOICE-01/02. Server-side Piper (VOICE-F01) is explicitly deferred.

Common Pitfalls

Pitfall 1: `/transcribe` Returns 404

What goes wrong: VoiceRecordButton sends audio to /api/transcribe, gets a 404, swallows the error silently. Voice input appears broken with no feedback to the user.

Why it happens: chatFileRoutes is exported in routes/index.ts but not imported or registered in app.ts. The route exists in code but is never mounted.

How to avoid: Add import { chatFileRoutes } from "./routes/chat-files.js" and api.use(chatFileRoutes(db, opts.storageService)) in app.ts.

Warning signs: GET /api/transcribe returns 404; no logs from the route handler; VoiceRecordButton spinner appears and disappears with no text inserted.

Pitfall 2: Piper TTS Silent Hang on First Use

What goes wrong: User clicks "speak" button. Nothing happens for 15-30 seconds, then audio plays. User thinks it's broken.

Why it happens: tts.predict() internally downloads the ONNX model (~10-50MB) on first call with no progress feedback.

How to avoid: Call tts.download(voiceId, progressCb) explicitly before the first predict(). Show a progress bar or spinner with percentage. The prewarm() pattern in the hook above is the canonical fix for VOICE-02.

Warning signs: First TTS invocation is slow (10-30s), subsequent calls are fast. Model in browser DevTools IndexedDB after first successful call.

Pitfall 3: `chatFileRoutes` Argument Mismatch

What goes wrong: Passing incorrect arguments to chatFileRoutes(db, storage) — e.g., passing the wrong storage interface type.

Why it happens: app.ts uses opts.storageService which is a StorageService. The function signature is chatFileRoutes(db: Db, storage: StorageService).

How to avoid: Verify the StorageService import path and type. In app.ts, opts.storageService is already typed as StorageService and is used by other routes (e.g., assetRoutes(db, opts.storageService)). Mirror that pattern exactly.

Pitfall 4: Onboarding Step Counter Mismatch

What goes wrong: Adding step 4 (voice) but forgetting to update Back/Continue setStep() calls in steps 5 and 6, causing step transitions to skip or loop.

Why it happens: NexusOnboardingWizard.tsx has hard-coded step numbers throughout (setStep(4), setStep(5), etc.) and a Step type union (1 | 2 | 3 | 4 | 5).

How to avoid: When inserting step 4, do a full audit of all setStep(N) calls and the Step type. The type Step = 1 | 2 | 3 | 4 | 5 must become 1 | 2 | 3 | 4 | 5 | 6. All old setStep(4) → setStep(5), setStep(5) → setStep(6).

Pitfall 5: piper-tts-web In a Web Worker Context

What goes wrong: Importing @mintplex-labs/piper-tts-web fails in a Web Worker because of missing window or document globals.

Why it happens: The library expects browser globals. It also mentions supporting Web Worker patterns but requires careful WASM path configuration.

How to avoid: Use the library from a regular React component/hook (main thread). Do not import in server-side code, Node.js workers, or Vitest Node environment tests. Mark test files importing it with @vitest-environment jsdom if needed, or mock the module in tests.

Code Examples

Fix: Register chatFileRoutes in app.ts

// Source: server/src/app.ts (existing pattern — mirror assetRoutes)

// Near top of file with other route imports:
import { chatFileRoutes } from "./routes/chat-files.js";

// Inside createApp(), after api.use(assistantHandoffRoutes(db)):
api.use(chatFileRoutes(db, opts.storageService));

Confirmed pattern from app.ts line 147:

api.use(assetRoutes(db, opts.storageService));

TTS: Minimal predict() call

// Source: @mintplex-labs/piper-tts-web README

import { tts } from "@mintplex-labs/piper-tts-web";

// Download model with progress (pre-warm):
await tts.download("en_US-hfc_female-medium", (progress) => {
  const pct = Math.round((progress.loaded / progress.total) * 100);
  console.log(`Downloading voice model: ${pct}%`);
});

// Synthesize:
const wav = await tts.predict({
  text: "Hello, I am your assistant.",
  voiceId: "en_US-hfc_female-medium",
});

// Play:
const audio = new Audio(wav);
audio.play();

TTS: Check if already downloaded (skip re-download)

// Source: @mintplex-labs/piper-tts-web README

const stored = await tts.stored(); // string[] of cached voiceIds
if (!stored.includes("en_US-hfc_female-medium")) {
  await tts.download("en_US-hfc_female-medium", progressCb);
}

Mic availability probe (no library required)

// Source: MDN Web API (browser standard)

async function hasMicrophone(): Promise<boolean> {
  try {
    const devices = await navigator.mediaDevices.enumerateDevices();
    return devices.some((d) => d.kind === "audioinput");
  } catch {
    return false;
  }
}

State of the Art

Old Approach	Current Approach	When Changed	Impact
whisper CLI (openai-whisper Python)	smart-whisper Node.js native binding	2023-2024	No Python runtime needed; better perf
Piper CLI binary	@mintplex-labs/piper-tts-web WASM	2024	Runs in browser, no server setup
Server-rendered TTS audio	Client-side WASM synthesis	2024	Eliminates network round-trip; offline-safe

Deprecated/outdated:

whisper-cpp CLI: still works but requires system-level install; the existing /transcribe route already has this fallback — adequate for now
rhasspy/piper repository: archived Oct 2025, development moved to OHF-Voice/piper1-gpl; the @mintplex-labs/piper-tts-web npm package uses the original archived models (MIT) and still works

Open Questions

nexus-settings voice persistence
- What we know: nexus-settings.json currently only stores { mode }. The nexusSettingsSchema is a Zod schema.
- What's unclear: Should voiceEnabled: boolean be added to the schema? The constraint says "no DB schema changes" but this is a file-backed JSON, not a DB table.
- Recommendation: Add voiceEnabled: z.boolean().default(false) to nexusSettingsSchema. This is a file field, not a DB migration. The planner should confirm this is acceptable under the "no DB schema changes" constraint.
smart-whisper Apple Silicon unverified claim (from STATE.md blockers)
- What we know: STATE.md notes "smart-whisper Apple Silicon acceleration claim unverified on Mac Mini M4 — fall back to tiny.en if base.en acceleration not confirmed on device."
- What's unclear: Whether Metal acceleration actually works for base.en on M4.
- Recommendation: The current /transcribe route uses CLI fallback anyway. Since this phase is NOT rewriting STT with smart-whisper (just fixing route registration), this blocker does not apply to Phase 34.
VoiceRecordButton in PersonalAssistant
- What we know: ChatPanel sets enableVoiceInput={true}. PersonalAssistant.tsx does not use ChatInput and has its own send form that does NOT include a VoiceRecordButton. Voice input only works in the project-mode ChatPanel, not in the personal assistant chat.
- What's unclear: Whether VOICE-01/02/03 require voice in personal assistant chat specifically.
- Recommendation: Planner should add VoiceRecordButton to PersonalAssistant.tsx's input area as part of this phase, since personal assistant is the primary chat surface for v1.5.

Environment Availability

Dependency	Required By	Available	Version	Fallback
Node.js	Server runtime	Yes	v20.20.2	—
piper CLI	VOICE-01 (server-side)	No	—	Browser WASM via piper-tts-web (preferred)
whisper CLI	/transcribe route	No	—	Route returns 503 with user-visible error
whisper-cpp CLI	/transcribe route	No	—	Falls through to openai-whisper, then 503
ffmpeg	WebM→PCM conversion	No	—	Keep CLI-fallback STT; no smart-whisper upgrade this phase
Browser MediaRecorder	VoiceRecordButton	N/A (browser)	—	Degrades gracefully (mic unavailable state)

Missing dependencies with no fallback:

None that block this phase — the /transcribe route already handles missing Whisper CLIs gracefully with a 503 + descriptive error. Piper TTS runs entirely in browser WASM, no server dep.

Missing dependencies with fallback:

whisper / whisper-cpp: Not installed. Route returns { error: "Whisper not available. Install whisper-cpp or openai-whisper for voice input." } with 503. This is existing behavior. STT will silently fail until user installs Whisper, which is acceptable given the 503 message guides them.

Validation Architecture

Test Framework

Property	Value
Framework	Vitest 3.0.5
Config file	`server/vitest.config.ts`
Quick run command	`npx vitest run server/src/__tests__/34-voice-routes.test.ts`
Full suite command	`npx vitest run` (from `/opt/nexus`)

Phase Requirements → Test Map

Req ID	Behavior	Test Type	Automated Command	File Exists?
VOICE-01	`chatFileRoutes` is mounted in `app.ts` — GET/POST routes are reachable	unit (route)	`npx vitest run server/src/__tests__/34-voice-routes.test.ts`	No — Wave 0
VOICE-02	`usePiperTts` hook exposes `prewarm()`, `status`, `progress`	unit (hook)	`npx vitest run ui/src/hooks/usePiperTts.test.ts`	No — Wave 0
VOICE-03	`NexusOnboardingWizard` renders voice step at step 4	unit (component)	Manual / `npx vitest run ui/src/components/NexusOnboardingWizard.test.ts`	No — Wave 0

Sampling Rate

Per task commit: npx vitest run server/src/__tests__/34-voice-routes.test.ts
Per wave merge: npx vitest run
Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

server/src/__tests__/34-voice-routes.test.ts — covers VOICE-01 (route registration, 503 when no whisper CLI)
ui/src/hooks/usePiperTts.test.ts — covers VOICE-02 hook state machine (mock piper-tts-web)
ui/src/components/onboarding/VoiceStep.test.tsx — covers VOICE-03 step rendering

Sources

Primary (HIGH confidence)

Codebase inspection — server/src/routes/chat-files.ts, server/src/app.ts, ui/src/components/VoiceRecordButton.tsx, ui/src/components/NexusOnboardingWizard.tsx
npm registry — @mintplex-labs/piper-tts-web@1.0.4, smart-whisper@0.8.1 (verified 2026-04-01)

Secondary (MEDIUM confidence)

Mintplex-Labs/piper-tts-web README — tts.download(), tts.predict(), tts.stored() API
JacobLinCool/smart-whisper GitHub — Whisper class, PCM Float32Array requirement, Metal on Apple Silicon
smart-whisper documentation — transcribe API, model manager

Tertiary (LOW confidence)

WebSearch results for Piper TTS Node.js integration — browser-only WASM pattern confirmed by multiple sources

Metadata

Confidence breakdown:

Standard stack: MEDIUM — npm package versions verified; API verified via README; no local test environment to run the library
Architecture: HIGH — based on direct codebase inspection (route registration gap confirmed, wizard step structure confirmed)
Pitfalls: HIGH — route registration gap is a confirmed code-level fact, not speculation

Research date: 2026-04-01 Valid until: 2026-05-01 (stable libraries; piper-tts-web and smart-whisper are low-churn)

24 KiB Raw Blame History Unescape Escape