nexus/.planning/phases/34-voice/34-RESEARCH.md
2026-04-03 22:23:50 +00:00

470 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 34: Voice - Research
**Researched:** 2026-04-01
**Domain:** Browser STT (Whisper via smart-whisper), Browser TTS (Piper via @mintplex-labs/piper-tts-web WASM), Onboarding voice step
**Confidence:** MEDIUM
---
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
None — all implementation choices are at Claude's discretion.
### Claude's Discretion
All implementation choices are at Claude's discretion.
### Deferred Ideas (OUT OF SCOPE)
None.
</user_constraints>
---
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| VOICE-01 | User gets Piper TTS speech output that works on CPU-only hardware | @mintplex-labs/piper-tts-web runs entirely in browser WASM via ONNX Runtime — no GPU needed |
| VOICE-02 | Piper TTS pre-warms on first use with visible download progress (no silent 15-30s hang) | `tts.download(voiceId, progressCallback)` API provides loaded/total bytes; render a progress bar before calling `predict()` |
| VOICE-03 | Voice features (Whisper STT + Piper TTS) offered during onboarding based on hardware capability | NexusOnboardingWizard currently has 5 steps; add a step 4 (voice) gated on `hardwareInfo.hardwareTier !== undefined`; all tiers can run voice since it is purely CPU-bound WASM |
</phase_requirements>
---
## Summary
Phase 34 adds two voice capabilities: speech-to-text (STT) via Whisper, and text-to-speech (TTS) via Piper, plus an onboarding step where users can opt into voice features.
The STT side already has a server route (`POST /api/transcribe` in `chat-files.ts`) and a `VoiceRecordButton` component that calls it. The route is implemented correctly but has a critical gap: it is **exported from `routes/index.ts` but never registered in `app.ts`**, so `POST /api/transcribe` returns 404 at runtime. Fixing this registration is the primary STT task.
For TTS, the project currently has zero Piper integration. The recommended approach is browser-side WASM via `@mintplex-labs/piper-tts-web` (v1.0.4, MIT). This library wraps the Piper ONNX models in WebAssembly so synthesis runs on-device without a server round-trip, satisfying VOICE-01 (CPU-only hardware). The key UX concern (VOICE-02) is a 10-50 MB model download that blocks first synthesis — the library provides a `download()` method with a progress callback that must be wired to a visible UI element before calling `predict()`.
The onboarding voice step (VOICE-03) should be inserted into `NexusOnboardingWizard.tsx` as step 4 (shifting the existing "root directory" step to 5 and "summary" to 6). The step should probe mic permission availability and detect whether the browser supports `MediaRecorder` to inform the user, then offer a "yes, enable voice" / "skip" choice. Since all hardware tiers can run browser-WASM TTS, the gate is not tier-based — it is browser-capability-based.
**Primary recommendation:** Register `chatFileRoutes` in `app.ts` to fix STT; add `@mintplex-labs/piper-tts-web` for browser-side TTS with a progress-bar pre-warm flow; add a voice opt-in step in `NexusOnboardingWizard`.
---
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| `@mintplex-labs/piper-tts-web` | 1.0.4 | Browser-side Piper TTS via WASM/ONNX | Browser-only, no server infra, CPU-safe, actively maintained fork used in AnythingLLM |
| `smart-whisper` | 0.8.1 | Native Node.js Whisper.cpp binding for STT | Auto-downloads models, Metal on Apple Silicon, used as drop-in replacement for CLI approach |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| `node-wav` | 0.0.2 | Decode WAV buffer to Float32Array for smart-whisper | Required: smart-whisper only accepts 16kHz Float32Array PCM, not raw webm |
**Note on audio conversion:** The browser sends `audio/webm;codecs=opus`. smart-whisper requires 16kHz mono Float32Array PCM. `ffmpeg` is not present on this machine. The existing `/transcribe` route writes a temp `.webm` file and calls the `whisper` or `whisper-cpp` CLI — this works when those CLIs are installed. If upgrading to `smart-whisper`, a conversion step is required. The server-side `ffmpeg` is not available, so either: (a) require `ffmpeg` as an install-time dep via `fluent-ffmpeg` + system `ffmpeg`, or (b) keep the CLI-fallback pattern in the existing route and just **fix the route registration** rather than rewriting the transcription logic. Option (b) is lower risk.
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| `@mintplex-labs/piper-tts-web` (browser) | `piper` Python CLI via server route | CLI requires Python + model install; adds server complexity; VOICE-F01 deferred to future |
| Fix route registration | Rewrite transcription with `smart-whisper` | smart-whisper requires PCM conversion (no ffmpeg on host); high risk; the existing CLI fallback is simpler |
**Installation (UI):**
```bash
pnpm --filter @paperclipai/ui add @mintplex-labs/piper-tts-web
```
**No new server deps needed** for the minimal fix (just registering existing route). If upgrading to smart-whisper in a future phase:
```bash
pnpm --filter @paperclipai/server add smart-whisper node-wav
```
**Version verification (confirmed against npm registry 2026-04-01):**
- `@mintplex-labs/piper-tts-web`: 1.0.4 (latest)
- `smart-whisper`: 0.8.1 (latest)
- `node-wav`: 0.0.2 (latest)
---
## Architecture Patterns
### Recommended Project Structure
```
server/src/
├── app.ts # ADD: chatFileRoutes registration (1-line fix)
├── routes/
│ └── chat-files.ts # Existing /transcribe route — no changes needed
│ └── voice.ts # Optional: extract a dedicated voice route if /synthesize added
ui/src/
├── components/
│ ├── VoiceRecordButton.tsx # Existing — no changes needed once server route is fixed
│ ├── TtsButton.tsx # NEW: speaker icon button that calls piper-tts-web predict()
│ └── onboarding/
│ └── VoiceStep.tsx # NEW: opt-in step for voice features
├── hooks/
│ └── usePiperTts.ts # NEW: singleton TtsSession, download(), predict(), status
├── NexusOnboardingWizard.tsx # MODIFY: insert step 4 (voice), shift steps 4→5, 5→6
```
### Pattern 1: Route Registration Fix (STT)
**What:** `chatFileRoutes` is defined and exported but never registered in `app.ts`. Add one import and one `api.use()` call.
**When to use:** This is the only required change for STT to function.
**Example:**
```typescript
// server/src/app.ts — add after line ~31 (other imports)
import { chatFileRoutes } from "./routes/chat-files.js";
// ...inside createApp, after api.use(assistantHandoffRoutes(db)):
api.use(chatFileRoutes(db, opts.storageService));
```
The `chatFileRoutes` function signature: `chatFileRoutes(db: Db, storage: StorageService)`.
In `app.ts`, `opts.storageService` is the storage argument.
### Pattern 2: Piper TTS Hook (Browser-Side WASM)
**What:** A React hook wrapping `@mintplex-labs/piper-tts-web` that manages model download state and synthesis. The model download is the pre-warm step that prevents the silent 15-30s hang on first synthesis.
**When to use:** Any component that needs to read assistant responses aloud.
**Example:**
```typescript
// ui/src/hooks/usePiperTts.ts
import { tts } from "@mintplex-labs/piper-tts-web";
const DEFAULT_VOICE = "en_US-hfc_female-medium";
export function usePiperTts() {
const [status, setStatus] = useState<"idle" | "downloading" | "ready" | "speaking">("idle");
const [progress, setProgress] = useState(0); // 0100
async function prewarm() {
setStatus("downloading");
const stored = await tts.stored();
if (!stored.includes(DEFAULT_VOICE)) {
await tts.download(DEFAULT_VOICE, (p) => {
setProgress(Math.round((p.loaded / p.total) * 100));
});
}
setStatus("ready");
}
async function speak(text: string) {
if (status !== "ready") return;
setStatus("speaking");
const wav = await tts.predict({ text, voiceId: DEFAULT_VOICE });
const audio = new Audio(wav);
audio.onended = () => setStatus("ready");
audio.play();
}
return { status, progress, prewarm, speak };
}
```
**Key points:**
- `tts.predict()` returns a Blob URL (WAV format). Use `new Audio(blobUrl).play()` — simplest approach, no Web Audio API needed.
- `tts.stored()` checks IndexedDB cache; download is skipped if model already present.
- The library is browser-only. Do not import in server code.
### Pattern 3: Onboarding Voice Step
**What:** Add a step 4 in `NexusOnboardingWizard.tsx` that shows STT+TTS capability, checks mic permission, and lets users opt in. Because piper-tts-web is CPU-safe WASM, the gate is browser capability (`navigator.mediaDevices`), not hardware tier.
**When to use:** VOICE-03 requirement — offer voice during onboarding.
**Step numbering shift:**
- Current: 1=hardware, 2=mode, 3=provider, 4=rootDir, 5=summary
- New: 1=hardware, 2=mode, 3=provider, **4=voice**, 5=rootDir, 6=summary
- Update `Step` type from `1 | 2 | 3 | 4 | 5` to `1 | 2 | 3 | 4 | 5 | 6`
- Update "Step X of 5" label to "Step X of 5" (keep label at 5 since summary is a bonus; or "Step X of 6")
- Update all `setStep()` calls to use new numbers
**Voice opt-in state to track:**
```typescript
const [voiceEnabled, setVoiceEnabled] = useState(false);
```
Store in `nexus-settings.json` via a new field (e.g., `voiceEnabled: boolean`) if persistence across sessions is desired. Or store in localStorage if the no-DB-schema constraint applies (it does — no schema changes, use file-backed JSON).
**Example VoiceStep component structure:**
```tsx
// ui/src/components/onboarding/VoiceStep.tsx
export function VoiceStep({ onEnable, onSkip }: VoiceStepProps) {
const [micAvailable, setMicAvailable] = useState<boolean | null>(null);
useEffect(() => {
// Non-blocking probe: does browser support mic?
navigator.mediaDevices?.enumerateDevices()
.then(devices => setMicAvailable(devices.some(d => d.kind === "audioinput")))
.catch(() => setMicAvailable(false));
}, []);
return (
<>
<h1>Voice features</h1>
<p>Speak to your assistant (Whisper STT) and hear responses read aloud (Piper TTS). Runs entirely on your device.</p>
{micAvailable === false && (
<p className="text-muted-foreground text-sm">No microphone detected STT unavailable, but TTS still works.</p>
)}
<Button onClick={onEnable}>Enable voice</Button>
<Button variant="ghost" onClick={onSkip}>Skip</Button>
</>
);
}
```
### Anti-Patterns to Avoid
- **Importing piper-tts-web in Node.js:** The library explicitly does not support Node.js. It must only be imported in browser code (UI package). Vite will not include it in the server bundle.
- **Calling `tts.predict()` before downloading the model:** Results in a 15-30s silent hang. Always call `tts.download()` first (or check `tts.stored()`), show progress, then call `predict()`.
- **Registering `/transcribe` before auth middleware:** The existing `/transcribe` route calls `assertBoard(req)` — it must sit inside the `api` sub-router (after `boardMutationGuard`), not before it. The `chatFileRoutes` call belongs at line ~161 of `app.ts` alongside other `api.use()` calls.
- **Using `new Audio()` with a raw Buffer:** `tts.predict()` returns a Blob URL string — pass it directly to `new Audio(url)`, not `new Audio(Buffer)`.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| TTS synthesis in browser | Custom ONNX loader + Piper WASM integration | `@mintplex-labs/piper-tts-web` | Already bundles ort-wasm, phenomizer, model management — 492KB package handles all of it |
| Model download progress | Manual fetch with XHR progress | `tts.download(voiceId, progressCb)` | Built-in progress callback, automatic IndexedDB caching |
| PCM audio decoding for Whisper | Custom webm→PCM decoder | Keep CLI fallback in existing `/transcribe` route | No ffmpeg on host; smart-whisper requires PCM — adding conversion is out-of-scope for this phase |
| Mic permission detection | Custom navigator probe | `navigator.mediaDevices.enumerateDevices()` | Native browser API, no library needed |
**Key insight:** The browser handles TTS completely — no server-side Piper install needed for VOICE-01/02. Server-side Piper (VOICE-F01) is explicitly deferred.
---
## Common Pitfalls
### Pitfall 1: `/transcribe` Returns 404
**What goes wrong:** VoiceRecordButton sends audio to `/api/transcribe`, gets a 404, swallows the error silently. Voice input appears broken with no feedback to the user.
**Why it happens:** `chatFileRoutes` is exported in `routes/index.ts` but not imported or registered in `app.ts`. The route exists in code but is never mounted.
**How to avoid:** Add `import { chatFileRoutes } from "./routes/chat-files.js"` and `api.use(chatFileRoutes(db, opts.storageService))` in `app.ts`.
**Warning signs:** `GET /api/transcribe` returns 404; no logs from the route handler; VoiceRecordButton spinner appears and disappears with no text inserted.
### Pitfall 2: Piper TTS Silent Hang on First Use
**What goes wrong:** User clicks "speak" button. Nothing happens for 15-30 seconds, then audio plays. User thinks it's broken.
**Why it happens:** `tts.predict()` internally downloads the ONNX model (~10-50MB) on first call with no progress feedback.
**How to avoid:** Call `tts.download(voiceId, progressCb)` explicitly before the first `predict()`. Show a progress bar or spinner with percentage. The `prewarm()` pattern in the hook above is the canonical fix for VOICE-02.
**Warning signs:** First TTS invocation is slow (10-30s), subsequent calls are fast. Model in browser DevTools IndexedDB after first successful call.
### Pitfall 3: `chatFileRoutes` Argument Mismatch
**What goes wrong:** Passing incorrect arguments to `chatFileRoutes(db, storage)` — e.g., passing the wrong storage interface type.
**Why it happens:** `app.ts` uses `opts.storageService` which is a `StorageService`. The function signature is `chatFileRoutes(db: Db, storage: StorageService)`.
**How to avoid:** Verify the StorageService import path and type. In `app.ts`, `opts.storageService` is already typed as `StorageService` and is used by other routes (e.g., `assetRoutes(db, opts.storageService)`). Mirror that pattern exactly.
### Pitfall 4: Onboarding Step Counter Mismatch
**What goes wrong:** Adding step 4 (voice) but forgetting to update Back/Continue `setStep()` calls in steps 5 and 6, causing step transitions to skip or loop.
**Why it happens:** `NexusOnboardingWizard.tsx` has hard-coded step numbers throughout (`setStep(4)`, `setStep(5)`, etc.) and a `Step` type union (`1 | 2 | 3 | 4 | 5`).
**How to avoid:** When inserting step 4, do a full audit of all `setStep(N)` calls and the `Step` type. The `type Step = 1 | 2 | 3 | 4 | 5` must become `1 | 2 | 3 | 4 | 5 | 6`. All old `setStep(4)``setStep(5)`, `setStep(5)``setStep(6)`.
### Pitfall 5: piper-tts-web In a Web Worker Context
**What goes wrong:** Importing `@mintplex-labs/piper-tts-web` fails in a Web Worker because of missing `window` or `document` globals.
**Why it happens:** The library expects browser globals. It also mentions supporting Web Worker patterns but requires careful WASM path configuration.
**How to avoid:** Use the library from a regular React component/hook (main thread). Do not import in server-side code, Node.js workers, or Vitest Node environment tests. Mark test files importing it with `@vitest-environment jsdom` if needed, or mock the module in tests.
---
## Code Examples
### Fix: Register chatFileRoutes in app.ts
```typescript
// Source: server/src/app.ts (existing pattern — mirror assetRoutes)
// Near top of file with other route imports:
import { chatFileRoutes } from "./routes/chat-files.js";
// Inside createApp(), after api.use(assistantHandoffRoutes(db)):
api.use(chatFileRoutes(db, opts.storageService));
```
Confirmed pattern from `app.ts` line 147:
```typescript
api.use(assetRoutes(db, opts.storageService));
```
### TTS: Minimal predict() call
```typescript
// Source: @mintplex-labs/piper-tts-web README
import { tts } from "@mintplex-labs/piper-tts-web";
// Download model with progress (pre-warm):
await tts.download("en_US-hfc_female-medium", (progress) => {
const pct = Math.round((progress.loaded / progress.total) * 100);
console.log(`Downloading voice model: ${pct}%`);
});
// Synthesize:
const wav = await tts.predict({
text: "Hello, I am your assistant.",
voiceId: "en_US-hfc_female-medium",
});
// Play:
const audio = new Audio(wav);
audio.play();
```
### TTS: Check if already downloaded (skip re-download)
```typescript
// Source: @mintplex-labs/piper-tts-web README
const stored = await tts.stored(); // string[] of cached voiceIds
if (!stored.includes("en_US-hfc_female-medium")) {
await tts.download("en_US-hfc_female-medium", progressCb);
}
```
### Mic availability probe (no library required)
```typescript
// Source: MDN Web API (browser standard)
async function hasMicrophone(): Promise<boolean> {
try {
const devices = await navigator.mediaDevices.enumerateDevices();
return devices.some((d) => d.kind === "audioinput");
} catch {
return false;
}
}
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| whisper CLI (openai-whisper Python) | smart-whisper Node.js native binding | 2023-2024 | No Python runtime needed; better perf |
| Piper CLI binary | @mintplex-labs/piper-tts-web WASM | 2024 | Runs in browser, no server setup |
| Server-rendered TTS audio | Client-side WASM synthesis | 2024 | Eliminates network round-trip; offline-safe |
**Deprecated/outdated:**
- `whisper-cpp` CLI: still works but requires system-level install; the existing `/transcribe` route already has this fallback — adequate for now
- `rhasspy/piper` repository: archived Oct 2025, development moved to `OHF-Voice/piper1-gpl`; the `@mintplex-labs/piper-tts-web` npm package uses the original archived models (MIT) and still works
---
## Open Questions
1. **nexus-settings voice persistence**
- What we know: `nexus-settings.json` currently only stores `{ mode }`. The `nexusSettingsSchema` is a Zod schema.
- What's unclear: Should `voiceEnabled: boolean` be added to the schema? The constraint says "no DB schema changes" but this is a file-backed JSON, not a DB table.
- Recommendation: Add `voiceEnabled: z.boolean().default(false)` to `nexusSettingsSchema`. This is a file field, not a DB migration. The planner should confirm this is acceptable under the "no DB schema changes" constraint.
2. **smart-whisper Apple Silicon unverified claim (from STATE.md blockers)**
- What we know: STATE.md notes "smart-whisper Apple Silicon acceleration claim unverified on Mac Mini M4 — fall back to `tiny.en` if `base.en` acceleration not confirmed on device."
- What's unclear: Whether Metal acceleration actually works for `base.en` on M4.
- Recommendation: The current `/transcribe` route uses CLI fallback anyway. Since this phase is NOT rewriting STT with smart-whisper (just fixing route registration), this blocker does not apply to Phase 34.
3. **VoiceRecordButton in PersonalAssistant**
- What we know: `ChatPanel` sets `enableVoiceInput={true}`. `PersonalAssistant.tsx` does not use `ChatInput` and has its own send form that does NOT include a `VoiceRecordButton`. Voice input only works in the project-mode `ChatPanel`, not in the personal assistant chat.
- What's unclear: Whether VOICE-01/02/03 require voice in personal assistant chat specifically.
- Recommendation: Planner should add `VoiceRecordButton` to `PersonalAssistant.tsx`'s input area as part of this phase, since personal assistant is the primary chat surface for v1.5.
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Node.js | Server runtime | Yes | v20.20.2 | — |
| piper CLI | VOICE-01 (server-side) | No | — | Browser WASM via piper-tts-web (preferred) |
| whisper CLI | /transcribe route | No | — | Route returns 503 with user-visible error |
| whisper-cpp CLI | /transcribe route | No | — | Falls through to openai-whisper, then 503 |
| ffmpeg | WebM→PCM conversion | No | — | Keep CLI-fallback STT; no smart-whisper upgrade this phase |
| Browser MediaRecorder | VoiceRecordButton | N/A (browser) | — | Degrades gracefully (mic unavailable state) |
**Missing dependencies with no fallback:**
- None that block this phase — the `/transcribe` route already handles missing Whisper CLIs gracefully with a 503 + descriptive error. Piper TTS runs entirely in browser WASM, no server dep.
**Missing dependencies with fallback:**
- `whisper` / `whisper-cpp`: Not installed. Route returns `{ error: "Whisper not available. Install whisper-cpp or openai-whisper for voice input." }` with 503. This is existing behavior. STT will silently fail until user installs Whisper, which is acceptable given the 503 message guides them.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | Vitest 3.0.5 |
| Config file | `server/vitest.config.ts` |
| Quick run command | `npx vitest run server/src/__tests__/34-voice-routes.test.ts` |
| Full suite command | `npx vitest run` (from `/opt/nexus`) |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| VOICE-01 | `chatFileRoutes` is mounted in `app.ts` — GET/POST routes are reachable | unit (route) | `npx vitest run server/src/__tests__/34-voice-routes.test.ts` | No — Wave 0 |
| VOICE-02 | `usePiperTts` hook exposes `prewarm()`, `status`, `progress` | unit (hook) | `npx vitest run ui/src/hooks/usePiperTts.test.ts` | No — Wave 0 |
| VOICE-03 | `NexusOnboardingWizard` renders voice step at step 4 | unit (component) | Manual / `npx vitest run ui/src/components/NexusOnboardingWizard.test.ts` | No — Wave 0 |
### Sampling Rate
- **Per task commit:** `npx vitest run server/src/__tests__/34-voice-routes.test.ts`
- **Per wave merge:** `npx vitest run`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `server/src/__tests__/34-voice-routes.test.ts` — covers VOICE-01 (route registration, 503 when no whisper CLI)
- [ ] `ui/src/hooks/usePiperTts.test.ts` — covers VOICE-02 hook state machine (mock piper-tts-web)
- [ ] `ui/src/components/onboarding/VoiceStep.test.tsx` — covers VOICE-03 step rendering
---
## Sources
### Primary (HIGH confidence)
- Codebase inspection — `server/src/routes/chat-files.ts`, `server/src/app.ts`, `ui/src/components/VoiceRecordButton.tsx`, `ui/src/components/NexusOnboardingWizard.tsx`
- npm registry — `@mintplex-labs/piper-tts-web@1.0.4`, `smart-whisper@0.8.1` (verified 2026-04-01)
### Secondary (MEDIUM confidence)
- [Mintplex-Labs/piper-tts-web README](https://github.com/Mintplex-Labs/piper-tts-web/blob/main/README.md) — `tts.download()`, `tts.predict()`, `tts.stored()` API
- [JacobLinCool/smart-whisper GitHub](https://github.com/JacobLinCool/smart-whisper) — Whisper class, PCM Float32Array requirement, Metal on Apple Silicon
- [smart-whisper documentation](https://jacoblincool.github.io/smart-whisper/) — transcribe API, model manager
### Tertiary (LOW confidence)
- WebSearch results for Piper TTS Node.js integration — browser-only WASM pattern confirmed by multiple sources
---
## Metadata
**Confidence breakdown:**
- Standard stack: MEDIUM — npm package versions verified; API verified via README; no local test environment to run the library
- Architecture: HIGH — based on direct codebase inspection (route registration gap confirmed, wizard step structure confirmed)
- Pitfalls: HIGH — route registration gap is a confirmed code-level fact, not speculation
**Research date:** 2026-04-01
**Valid until:** 2026-05-01 (stable libraries; piper-tts-web and smart-whisper are low-churn)