24 KiB
Phase 34: Voice - Research
Researched: 2026-04-01 Domain: Browser STT (Whisper via smart-whisper), Browser TTS (Piper via @mintplex-labs/piper-tts-web WASM), Onboarding voice step Confidence: MEDIUM
<user_constraints>
User Constraints (from CONTEXT.md)
Locked Decisions
None — all implementation choices are at Claude's discretion.
Claude's Discretion
All implementation choices are at Claude's discretion.
Deferred Ideas (OUT OF SCOPE)
None. </user_constraints>
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| VOICE-01 | User gets Piper TTS speech output that works on CPU-only hardware | @mintplex-labs/piper-tts-web runs entirely in browser WASM via ONNX Runtime — no GPU needed |
| VOICE-02 | Piper TTS pre-warms on first use with visible download progress (no silent 15-30s hang) | tts.download(voiceId, progressCallback) API provides loaded/total bytes; render a progress bar before calling predict() |
| VOICE-03 | Voice features (Whisper STT + Piper TTS) offered during onboarding based on hardware capability | NexusOnboardingWizard currently has 5 steps; add a step 4 (voice) gated on hardwareInfo.hardwareTier !== undefined; all tiers can run voice since it is purely CPU-bound WASM |
| </phase_requirements> |
Summary
Phase 34 adds two voice capabilities: speech-to-text (STT) via Whisper, and text-to-speech (TTS) via Piper, plus an onboarding step where users can opt into voice features.
The STT side already has a server route (POST /api/transcribe in chat-files.ts) and a VoiceRecordButton component that calls it. The route is implemented correctly but has a critical gap: it is exported from routes/index.ts but never registered in app.ts, so POST /api/transcribe returns 404 at runtime. Fixing this registration is the primary STT task.
For TTS, the project currently has zero Piper integration. The recommended approach is browser-side WASM via @mintplex-labs/piper-tts-web (v1.0.4, MIT). This library wraps the Piper ONNX models in WebAssembly so synthesis runs on-device without a server round-trip, satisfying VOICE-01 (CPU-only hardware). The key UX concern (VOICE-02) is a 10-50 MB model download that blocks first synthesis — the library provides a download() method with a progress callback that must be wired to a visible UI element before calling predict().
The onboarding voice step (VOICE-03) should be inserted into NexusOnboardingWizard.tsx as step 4 (shifting the existing "root directory" step to 5 and "summary" to 6). The step should probe mic permission availability and detect whether the browser supports MediaRecorder to inform the user, then offer a "yes, enable voice" / "skip" choice. Since all hardware tiers can run browser-WASM TTS, the gate is not tier-based — it is browser-capability-based.
Primary recommendation: Register chatFileRoutes in app.ts to fix STT; add @mintplex-labs/piper-tts-web for browser-side TTS with a progress-bar pre-warm flow; add a voice opt-in step in NexusOnboardingWizard.
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
@mintplex-labs/piper-tts-web |
1.0.4 | Browser-side Piper TTS via WASM/ONNX | Browser-only, no server infra, CPU-safe, actively maintained fork used in AnythingLLM |
smart-whisper |
0.8.1 | Native Node.js Whisper.cpp binding for STT | Auto-downloads models, Metal on Apple Silicon, used as drop-in replacement for CLI approach |
Supporting
| Library | Version | Purpose | When to Use |
|---|---|---|---|
node-wav |
0.0.2 | Decode WAV buffer to Float32Array for smart-whisper | Required: smart-whisper only accepts 16kHz Float32Array PCM, not raw webm |
Note on audio conversion: The browser sends audio/webm;codecs=opus. smart-whisper requires 16kHz mono Float32Array PCM. ffmpeg is not present on this machine. The existing /transcribe route writes a temp .webm file and calls the whisper or whisper-cpp CLI — this works when those CLIs are installed. If upgrading to smart-whisper, a conversion step is required. The server-side ffmpeg is not available, so either: (a) require ffmpeg as an install-time dep via fluent-ffmpeg + system ffmpeg, or (b) keep the CLI-fallback pattern in the existing route and just fix the route registration rather than rewriting the transcription logic. Option (b) is lower risk.
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
@mintplex-labs/piper-tts-web (browser) |
piper Python CLI via server route |
CLI requires Python + model install; adds server complexity; VOICE-F01 deferred to future |
| Fix route registration | Rewrite transcription with smart-whisper |
smart-whisper requires PCM conversion (no ffmpeg on host); high risk; the existing CLI fallback is simpler |
Installation (UI):
pnpm --filter @paperclipai/ui add @mintplex-labs/piper-tts-web
No new server deps needed for the minimal fix (just registering existing route). If upgrading to smart-whisper in a future phase:
pnpm --filter @paperclipai/server add smart-whisper node-wav
Version verification (confirmed against npm registry 2026-04-01):
@mintplex-labs/piper-tts-web: 1.0.4 (latest)smart-whisper: 0.8.1 (latest)node-wav: 0.0.2 (latest)
Architecture Patterns
Recommended Project Structure
server/src/
├── app.ts # ADD: chatFileRoutes registration (1-line fix)
├── routes/
│ └── chat-files.ts # Existing /transcribe route — no changes needed
│ └── voice.ts # Optional: extract a dedicated voice route if /synthesize added
ui/src/
├── components/
│ ├── VoiceRecordButton.tsx # Existing — no changes needed once server route is fixed
│ ├── TtsButton.tsx # NEW: speaker icon button that calls piper-tts-web predict()
│ └── onboarding/
│ └── VoiceStep.tsx # NEW: opt-in step for voice features
├── hooks/
│ └── usePiperTts.ts # NEW: singleton TtsSession, download(), predict(), status
├── NexusOnboardingWizard.tsx # MODIFY: insert step 4 (voice), shift steps 4→5, 5→6
Pattern 1: Route Registration Fix (STT)
What: chatFileRoutes is defined and exported but never registered in app.ts. Add one import and one api.use() call.
When to use: This is the only required change for STT to function.
Example:
// server/src/app.ts — add after line ~31 (other imports)
import { chatFileRoutes } from "./routes/chat-files.js";
// ...inside createApp, after api.use(assistantHandoffRoutes(db)):
api.use(chatFileRoutes(db, opts.storageService));
The chatFileRoutes function signature: chatFileRoutes(db: Db, storage: StorageService).
In app.ts, opts.storageService is the storage argument.
Pattern 2: Piper TTS Hook (Browser-Side WASM)
What: A React hook wrapping @mintplex-labs/piper-tts-web that manages model download state and synthesis. The model download is the pre-warm step that prevents the silent 15-30s hang on first synthesis.
When to use: Any component that needs to read assistant responses aloud.
Example:
// ui/src/hooks/usePiperTts.ts
import { tts } from "@mintplex-labs/piper-tts-web";
const DEFAULT_VOICE = "en_US-hfc_female-medium";
export function usePiperTts() {
const [status, setStatus] = useState<"idle" | "downloading" | "ready" | "speaking">("idle");
const [progress, setProgress] = useState(0); // 0–100
async function prewarm() {
setStatus("downloading");
const stored = await tts.stored();
if (!stored.includes(DEFAULT_VOICE)) {
await tts.download(DEFAULT_VOICE, (p) => {
setProgress(Math.round((p.loaded / p.total) * 100));
});
}
setStatus("ready");
}
async function speak(text: string) {
if (status !== "ready") return;
setStatus("speaking");
const wav = await tts.predict({ text, voiceId: DEFAULT_VOICE });
const audio = new Audio(wav);
audio.onended = () => setStatus("ready");
audio.play();
}
return { status, progress, prewarm, speak };
}
Key points:
tts.predict()returns a Blob URL (WAV format). Usenew Audio(blobUrl).play()— simplest approach, no Web Audio API needed.tts.stored()checks IndexedDB cache; download is skipped if model already present.- The library is browser-only. Do not import in server code.
Pattern 3: Onboarding Voice Step
What: Add a step 4 in NexusOnboardingWizard.tsx that shows STT+TTS capability, checks mic permission, and lets users opt in. Because piper-tts-web is CPU-safe WASM, the gate is browser capability (navigator.mediaDevices), not hardware tier.
When to use: VOICE-03 requirement — offer voice during onboarding.
Step numbering shift:
- Current: 1=hardware, 2=mode, 3=provider, 4=rootDir, 5=summary
- New: 1=hardware, 2=mode, 3=provider, 4=voice, 5=rootDir, 6=summary
- Update
Steptype from1 | 2 | 3 | 4 | 5to1 | 2 | 3 | 4 | 5 | 6 - Update "Step X of 5" label to "Step X of 5" (keep label at 5 since summary is a bonus; or "Step X of 6")
- Update all
setStep()calls to use new numbers
Voice opt-in state to track:
const [voiceEnabled, setVoiceEnabled] = useState(false);
Store in nexus-settings.json via a new field (e.g., voiceEnabled: boolean) if persistence across sessions is desired. Or store in localStorage if the no-DB-schema constraint applies (it does — no schema changes, use file-backed JSON).
Example VoiceStep component structure:
// ui/src/components/onboarding/VoiceStep.tsx
export function VoiceStep({ onEnable, onSkip }: VoiceStepProps) {
const [micAvailable, setMicAvailable] = useState<boolean | null>(null);
useEffect(() => {
// Non-blocking probe: does browser support mic?
navigator.mediaDevices?.enumerateDevices()
.then(devices => setMicAvailable(devices.some(d => d.kind === "audioinput")))
.catch(() => setMicAvailable(false));
}, []);
return (
<>
<h1>Voice features</h1>
<p>Speak to your assistant (Whisper STT) and hear responses read aloud (Piper TTS). Runs entirely on your device.</p>
{micAvailable === false && (
<p className="text-muted-foreground text-sm">No microphone detected — STT unavailable, but TTS still works.</p>
)}
<Button onClick={onEnable}>Enable voice</Button>
<Button variant="ghost" onClick={onSkip}>Skip</Button>
</>
);
}
Anti-Patterns to Avoid
- Importing piper-tts-web in Node.js: The library explicitly does not support Node.js. It must only be imported in browser code (UI package). Vite will not include it in the server bundle.
- Calling
tts.predict()before downloading the model: Results in a 15-30s silent hang. Always calltts.download()first (or checktts.stored()), show progress, then callpredict(). - Registering
/transcribebefore auth middleware: The existing/transcriberoute callsassertBoard(req)— it must sit inside theapisub-router (afterboardMutationGuard), not before it. ThechatFileRoutescall belongs at line ~161 ofapp.tsalongside otherapi.use()calls. - Using
new Audio()with a raw Buffer:tts.predict()returns a Blob URL string — pass it directly tonew Audio(url), notnew Audio(Buffer).
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| TTS synthesis in browser | Custom ONNX loader + Piper WASM integration | @mintplex-labs/piper-tts-web |
Already bundles ort-wasm, phenomizer, model management — 492KB package handles all of it |
| Model download progress | Manual fetch with XHR progress | tts.download(voiceId, progressCb) |
Built-in progress callback, automatic IndexedDB caching |
| PCM audio decoding for Whisper | Custom webm→PCM decoder | Keep CLI fallback in existing /transcribe route |
No ffmpeg on host; smart-whisper requires PCM — adding conversion is out-of-scope for this phase |
| Mic permission detection | Custom navigator probe | navigator.mediaDevices.enumerateDevices() |
Native browser API, no library needed |
Key insight: The browser handles TTS completely — no server-side Piper install needed for VOICE-01/02. Server-side Piper (VOICE-F01) is explicitly deferred.
Common Pitfalls
Pitfall 1: /transcribe Returns 404
What goes wrong: VoiceRecordButton sends audio to /api/transcribe, gets a 404, swallows the error silently. Voice input appears broken with no feedback to the user.
Why it happens: chatFileRoutes is exported in routes/index.ts but not imported or registered in app.ts. The route exists in code but is never mounted.
How to avoid: Add import { chatFileRoutes } from "./routes/chat-files.js" and api.use(chatFileRoutes(db, opts.storageService)) in app.ts.
Warning signs: GET /api/transcribe returns 404; no logs from the route handler; VoiceRecordButton spinner appears and disappears with no text inserted.
Pitfall 2: Piper TTS Silent Hang on First Use
What goes wrong: User clicks "speak" button. Nothing happens for 15-30 seconds, then audio plays. User thinks it's broken.
Why it happens: tts.predict() internally downloads the ONNX model (~10-50MB) on first call with no progress feedback.
How to avoid: Call tts.download(voiceId, progressCb) explicitly before the first predict(). Show a progress bar or spinner with percentage. The prewarm() pattern in the hook above is the canonical fix for VOICE-02.
Warning signs: First TTS invocation is slow (10-30s), subsequent calls are fast. Model in browser DevTools IndexedDB after first successful call.
Pitfall 3: chatFileRoutes Argument Mismatch
What goes wrong: Passing incorrect arguments to chatFileRoutes(db, storage) — e.g., passing the wrong storage interface type.
Why it happens: app.ts uses opts.storageService which is a StorageService. The function signature is chatFileRoutes(db: Db, storage: StorageService).
How to avoid: Verify the StorageService import path and type. In app.ts, opts.storageService is already typed as StorageService and is used by other routes (e.g., assetRoutes(db, opts.storageService)). Mirror that pattern exactly.
Pitfall 4: Onboarding Step Counter Mismatch
What goes wrong: Adding step 4 (voice) but forgetting to update Back/Continue setStep() calls in steps 5 and 6, causing step transitions to skip or loop.
Why it happens: NexusOnboardingWizard.tsx has hard-coded step numbers throughout (setStep(4), setStep(5), etc.) and a Step type union (1 | 2 | 3 | 4 | 5).
How to avoid: When inserting step 4, do a full audit of all setStep(N) calls and the Step type. The type Step = 1 | 2 | 3 | 4 | 5 must become 1 | 2 | 3 | 4 | 5 | 6. All old setStep(4) → setStep(5), setStep(5) → setStep(6).
Pitfall 5: piper-tts-web In a Web Worker Context
What goes wrong: Importing @mintplex-labs/piper-tts-web fails in a Web Worker because of missing window or document globals.
Why it happens: The library expects browser globals. It also mentions supporting Web Worker patterns but requires careful WASM path configuration.
How to avoid: Use the library from a regular React component/hook (main thread). Do not import in server-side code, Node.js workers, or Vitest Node environment tests. Mark test files importing it with @vitest-environment jsdom if needed, or mock the module in tests.
Code Examples
Fix: Register chatFileRoutes in app.ts
// Source: server/src/app.ts (existing pattern — mirror assetRoutes)
// Near top of file with other route imports:
import { chatFileRoutes } from "./routes/chat-files.js";
// Inside createApp(), after api.use(assistantHandoffRoutes(db)):
api.use(chatFileRoutes(db, opts.storageService));
Confirmed pattern from app.ts line 147:
api.use(assetRoutes(db, opts.storageService));
TTS: Minimal predict() call
// Source: @mintplex-labs/piper-tts-web README
import { tts } from "@mintplex-labs/piper-tts-web";
// Download model with progress (pre-warm):
await tts.download("en_US-hfc_female-medium", (progress) => {
const pct = Math.round((progress.loaded / progress.total) * 100);
console.log(`Downloading voice model: ${pct}%`);
});
// Synthesize:
const wav = await tts.predict({
text: "Hello, I am your assistant.",
voiceId: "en_US-hfc_female-medium",
});
// Play:
const audio = new Audio(wav);
audio.play();
TTS: Check if already downloaded (skip re-download)
// Source: @mintplex-labs/piper-tts-web README
const stored = await tts.stored(); // string[] of cached voiceIds
if (!stored.includes("en_US-hfc_female-medium")) {
await tts.download("en_US-hfc_female-medium", progressCb);
}
Mic availability probe (no library required)
// Source: MDN Web API (browser standard)
async function hasMicrophone(): Promise<boolean> {
try {
const devices = await navigator.mediaDevices.enumerateDevices();
return devices.some((d) => d.kind === "audioinput");
} catch {
return false;
}
}
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| whisper CLI (openai-whisper Python) | smart-whisper Node.js native binding | 2023-2024 | No Python runtime needed; better perf |
| Piper CLI binary | @mintplex-labs/piper-tts-web WASM | 2024 | Runs in browser, no server setup |
| Server-rendered TTS audio | Client-side WASM synthesis | 2024 | Eliminates network round-trip; offline-safe |
Deprecated/outdated:
whisper-cppCLI: still works but requires system-level install; the existing/transcriberoute already has this fallback — adequate for nowrhasspy/piperrepository: archived Oct 2025, development moved toOHF-Voice/piper1-gpl; the@mintplex-labs/piper-tts-webnpm package uses the original archived models (MIT) and still works
Open Questions
-
nexus-settings voice persistence
- What we know:
nexus-settings.jsoncurrently only stores{ mode }. ThenexusSettingsSchemais a Zod schema. - What's unclear: Should
voiceEnabled: booleanbe added to the schema? The constraint says "no DB schema changes" but this is a file-backed JSON, not a DB table. - Recommendation: Add
voiceEnabled: z.boolean().default(false)tonexusSettingsSchema. This is a file field, not a DB migration. The planner should confirm this is acceptable under the "no DB schema changes" constraint.
- What we know:
-
smart-whisper Apple Silicon unverified claim (from STATE.md blockers)
- What we know: STATE.md notes "smart-whisper Apple Silicon acceleration claim unverified on Mac Mini M4 — fall back to
tiny.enifbase.enacceleration not confirmed on device." - What's unclear: Whether Metal acceleration actually works for
base.enon M4. - Recommendation: The current
/transcriberoute uses CLI fallback anyway. Since this phase is NOT rewriting STT with smart-whisper (just fixing route registration), this blocker does not apply to Phase 34.
- What we know: STATE.md notes "smart-whisper Apple Silicon acceleration claim unverified on Mac Mini M4 — fall back to
-
VoiceRecordButton in PersonalAssistant
- What we know:
ChatPanelsetsenableVoiceInput={true}.PersonalAssistant.tsxdoes not useChatInputand has its own send form that does NOT include aVoiceRecordButton. Voice input only works in the project-modeChatPanel, not in the personal assistant chat. - What's unclear: Whether VOICE-01/02/03 require voice in personal assistant chat specifically.
- Recommendation: Planner should add
VoiceRecordButtontoPersonalAssistant.tsx's input area as part of this phase, since personal assistant is the primary chat surface for v1.5.
- What we know:
Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|---|---|---|---|---|
| Node.js | Server runtime | Yes | v20.20.2 | — |
| piper CLI | VOICE-01 (server-side) | No | — | Browser WASM via piper-tts-web (preferred) |
| whisper CLI | /transcribe route | No | — | Route returns 503 with user-visible error |
| whisper-cpp CLI | /transcribe route | No | — | Falls through to openai-whisper, then 503 |
| ffmpeg | WebM→PCM conversion | No | — | Keep CLI-fallback STT; no smart-whisper upgrade this phase |
| Browser MediaRecorder | VoiceRecordButton | N/A (browser) | — | Degrades gracefully (mic unavailable state) |
Missing dependencies with no fallback:
- None that block this phase — the
/transcriberoute already handles missing Whisper CLIs gracefully with a 503 + descriptive error. Piper TTS runs entirely in browser WASM, no server dep.
Missing dependencies with fallback:
whisper/whisper-cpp: Not installed. Route returns{ error: "Whisper not available. Install whisper-cpp or openai-whisper for voice input." }with 503. This is existing behavior. STT will silently fail until user installs Whisper, which is acceptable given the 503 message guides them.
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | Vitest 3.0.5 |
| Config file | server/vitest.config.ts |
| Quick run command | npx vitest run server/src/__tests__/34-voice-routes.test.ts |
| Full suite command | npx vitest run (from /opt/nexus) |
Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| VOICE-01 | chatFileRoutes is mounted in app.ts — GET/POST routes are reachable |
unit (route) | npx vitest run server/src/__tests__/34-voice-routes.test.ts |
No — Wave 0 |
| VOICE-02 | usePiperTts hook exposes prewarm(), status, progress |
unit (hook) | npx vitest run ui/src/hooks/usePiperTts.test.ts |
No — Wave 0 |
| VOICE-03 | NexusOnboardingWizard renders voice step at step 4 |
unit (component) | Manual / npx vitest run ui/src/components/NexusOnboardingWizard.test.ts |
No — Wave 0 |
Sampling Rate
- Per task commit:
npx vitest run server/src/__tests__/34-voice-routes.test.ts - Per wave merge:
npx vitest run - Phase gate: Full suite green before
/gsd:verify-work
Wave 0 Gaps
server/src/__tests__/34-voice-routes.test.ts— covers VOICE-01 (route registration, 503 when no whisper CLI)ui/src/hooks/usePiperTts.test.ts— covers VOICE-02 hook state machine (mock piper-tts-web)ui/src/components/onboarding/VoiceStep.test.tsx— covers VOICE-03 step rendering
Sources
Primary (HIGH confidence)
- Codebase inspection —
server/src/routes/chat-files.ts,server/src/app.ts,ui/src/components/VoiceRecordButton.tsx,ui/src/components/NexusOnboardingWizard.tsx - npm registry —
@mintplex-labs/piper-tts-web@1.0.4,smart-whisper@0.8.1(verified 2026-04-01)
Secondary (MEDIUM confidence)
- Mintplex-Labs/piper-tts-web README —
tts.download(),tts.predict(),tts.stored()API - JacobLinCool/smart-whisper GitHub — Whisper class, PCM Float32Array requirement, Metal on Apple Silicon
- smart-whisper documentation — transcribe API, model manager
Tertiary (LOW confidence)
- WebSearch results for Piper TTS Node.js integration — browser-only WASM pattern confirmed by multiple sources
Metadata
Confidence breakdown:
- Standard stack: MEDIUM — npm package versions verified; API verified via README; no local test environment to run the library
- Architecture: HIGH — based on direct codebase inspection (route registration gap confirmed, wizard step structure confirmed)
- Pitfalls: HIGH — route registration gap is a confirmed code-level fact, not speculation
Research date: 2026-04-01 Valid until: 2026-05-01 (stable libraries; piper-tts-web and smart-whisper are low-churn)