docs(37): phase research — VAD, COOP/COEP, component architecture

This commit is contained in:
Nexus Dev 2026-04-04 02:07:19 +00:00
parent f2f381a3a2
commit fdc956c6a6

View file

@ -0,0 +1,567 @@
# Phase 37: Web Chat Voice UI - Research
**Researched:** 2026-04-03
**Domain:** Browser voice I/O — VAD, MediaRecorder, Web Audio API, waveform visualization, audio playback, COOP/COEP headers
**Confidence:** HIGH
---
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
All implementation choices are at Claude's discretion — discuss phase was skipped per user setting.
### Claude's Discretion
All implementation details. Use ROADMAP phase goal, success criteria, and codebase conventions.
Key research findings baked into context:
- `@ricky0123/vad-react ^0.0.36` for browser-side silence detection (VAD)
- COOP/COEP headers required on Express server for SharedArrayBuffer
- Waveform via Web Audio API AnalyserNode (Canvas or SVG, 30-50 data points)
- Native `<audio>` element + URL.createObjectURL() for playback
- Three-state voice mode: "text" | "voice_input" | "full_voice"
- VoiceMicButton replaces/enhances existing VoiceRecordButton
- Voice badge + expandable markdown section in ChatMessage
### Deferred Ideas (OUT OF SCOPE)
None — discuss phase skipped.
</user_constraints>
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| WCHAT-01 | Mic button in chat input starts/stops voice recording with visual state (idle/recording/processing) | VoiceMicButton replaces VoiceRecordButton; three-state via recording/userSpeaking/loading from useMicVAD |
| WCHAT-02 | Recording auto-stops on silence detection via VAD | useMicVAD onSpeechEnd callback fires automatically after 1.5s silence; no manual stop needed |
| WCHAT-03 | Real-time waveform/amplitude visualization displays while recording | VoiceWaveform canvas component using Web Audio API AnalyserNode + requestAnimationFrame |
| WCHAT-04 | Voice response audio plays inline in chat message with audio player controls | ChatVoicePlayer with native `<audio>` + URL.createObjectURL(); POST /api/synthesize → blob |
| WCHAT-05 | User can toggle voice mode: text only / voice input only / full voice (input + output) | VoiceModeToggle three-pill component; persists to nexus-settings voiceMode field |
| WCHAT-06 | Auto-play of voice responses is configurable (on/off in settings) | autoPlay flag in nexus-settings or localStorage; ChatVoicePlayer reads it on mount |
</phase_requirements>
---
## Summary
Phase 37 adds browser-based voice I/O to the existing web chat. Phase 36 delivered the server-side pipeline (VoicePipelineService, POST /api/transcribe, POST /api/synthesize, voiceMode wiring in chat.ts) and the nexus-settings schema extension. Phase 37 is entirely a frontend phase with one server-side addition: COOP/COEP response headers on the Express static middleware.
The central library is `@ricky0123/vad-react ^0.0.36`, which wraps Silero VAD running in an AudioWorklet. It requires the page to be cross-origin isolated (COOP + COEP headers) to use SharedArrayBuffer. The package ships ONNX model files and a worklet bundle that must either be served locally from `public/` or loaded from its default CDN URLs. The CDN default is simpler and acceptable for development; production should serve them locally.
Waveform visualization uses a standard Web Audio API AnalyserNode pattern: connect the microphone stream → AnalyserNode → read Uint8Array in requestAnimationFrame loop → render bars on a `<canvas>`. This is entirely in-browser with no extra library. Audio playback for synthesized responses uses the native `<audio>` HTML element with `URL.createObjectURL()` from a Blob received from POST /api/synthesize.
**Primary recommendation:** Install @ricky0123/vad-react, add COOP/COEP headers to Express static/vite-dev middleware, serve VAD assets from `ui/public/`, build five new components + two hooks as specified in 37-UI-SPEC.md, extend ChatInput + ChatMessage, wire voiceMode through useStreamingChat.
---
## Branch Context (Critical)
The current worktree branch (`gsd/phase-35-npx-buildthis-cli`) has only Phase 36 Task 1 committed (VoicePipelineService). The remaining Phase 36 deliverables live on a separate branch not yet merged:
| Phase 36 Deliverable | Git Commit | Status in Current Branch |
|---|---|---|
| VoicePipelineService | `0ed912c2` | **PRESENT** |
| nexus-settings voiceMode schema | `d0d7a23a` | **ABSENT** — must be built in 37 Wave 0 or assumed present |
| voiceMode in createMessageSchema | `b964c0e4` | **ABSENT** |
| POST /api/transcribe, POST /api/synthesize routes | `11508547` | **ABSENT** |
| voiceMode wiring in chat.ts stream route | `fd372eaf` | **ABSENT** |
**Implication for planning:** Wave 0 of Phase 37 must either (a) merge/cherry-pick the Phase 36 remainder, or (b) re-implement those 3 deliverables before building Phase 37 UI. The plan should treat Phase 36 tasks 2-3 as Wave 0 prerequisites and verify them before proceeding.
The ChatInput.tsx, ChatMessage.tsx, VoiceRecordButton.tsx, and related UI components exist on the parent branch `PAP-878-create-a-mine-tab-in-inbox` but NOT in the current worktree. The plan must account for these being the integration targets.
---
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| `@ricky0123/vad-react` | `^0.0.36` | Browser VAD with 1.5s silence auto-stop | Specified in 37-UI-SPEC.md; only mature browser-side VAD library |
| `@ricky0123/vad-web` | `0.0.30` (peer) | VAD engine (AudioWorklet + Silero ONNX) | Peer dep of vad-react |
| `onnxruntime-web` | `^1.17.0` (peer) | ONNX runtime for Silero model | Required by vad-web |
| Web Audio API | browser built-in | AnalyserNode for waveform bars | Zero bundle cost; already in browser |
| Native `<audio>` | browser built-in | Playback of synthesized WAV | No extra library needed |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| lucide-react | `^0.574.0` (already in ui/) | `Mic`, `Square`, `Loader2`, `Volume2`, `Play`, `Pause` icons | Voice button states + audio player |
| shadcn/ui `Badge` | already installed | Voice badge on agent messages | ChatVoiceBadge component |
| shadcn/ui `Collapsible` | already installed | Expand/collapse full markdown in voice_full messages | ChatVoiceBadge expand section |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| @ricky0123/vad-react | Manual silence detection with AudioWorklet | Much more complex; vad-react is the defacto standard |
| Canvas waveform | SVG bars | Canvas performs better for 30fps animation |
| Native `<audio>` + blob URL | Howler.js | No extra dependency; native handles WAV fine |
**Installation:**
```bash
pnpm add @ricky0123/vad-react --filter @paperclipai/ui
```
**Version verification (confirmed against npm registry 2026-04-03):**
- `@ricky0123/vad-react`: 0.0.36 (latest)
- `@ricky0123/vad-web`: 0.0.30 (peer dependency, installed automatically)
- `onnxruntime-web`: 1.24.3 (latest; ^1.17.0 from vad-web is satisfied)
---
## Architecture Patterns
### Recommended Project Structure
```
ui/src/
├── components/
│ ├── VoiceMicButton.tsx # Replaces VoiceRecordButton — VAD + waveform + three states
│ ├── VoiceWaveform.tsx # Canvas amplitude bars (30-50 points, 32px tall)
│ ├── VoiceModeToggle.tsx # Three-pill: Text / Voice In / Full Voice
│ ├── ChatVoicePlayer.tsx # Inline audio player with play/pause/progress
│ └── ChatVoiceBadge.tsx # "Voice" badge + collapsible full markdown
├── hooks/
│ ├── useVadRecorder.ts # Wraps useMicVAD; exposes Float32Array on speech end
│ └── useVoiceMode.ts # Reads/writes voiceMode from nexus-settings
ui/public/
│ ├── vad.worklet.bundle.min.js # From @ricky0123/vad-web/dist/
│ ├── silero_vad_legacy.onnx # From @ricky0123/vad-web/dist/
│ └── silero_vad_v5.onnx # From @ricky0123/vad-web/dist/
server/src/
└── app.ts (add COOP/COEP headers middleware)
```
### Pattern 1: useMicVAD from @ricky0123/vad-react
**What:** Hook that runs Silero VAD in an AudioWorklet; fires `onSpeechEnd(audio: Float32Array)` after silence
**When to use:** VoiceMicButton and useVadRecorder hook
```typescript
// Source: https://docs.vad.ricky0123.com/user-guide/api/
import { useMicVAD } from "@ricky0123/vad-react";
const vad = useMicVAD({
startOnLoad: false, // user must click mic button first
onSpeechEnd: (audio: Float32Array) => {
// audio is Float32Array at 16kHz
// Convert to WAV blob and POST to /api/transcribe
},
onSpeechStart: () => { /* update waveform active state */ },
positiveSpeechThreshold: 0.8,
negativeSpeechThreshold: 0.8 - 0.15,
redemptionFrames: 8, // ~480ms silence before speech_end
baseAssetPath: "/", // serve from ui/public/
onnxWASMBasePath: "/",
});
// Returned: { listening, loading, errored, userSpeaking, start, pause }
```
**Audio conversion for upload:**
```typescript
// Float32Array → WAV blob (16kHz, mono, 16-bit PCM)
function float32ToWav(samples: Float32Array, sampleRate = 16000): Blob {
const buffer = new ArrayBuffer(44 + samples.length * 2);
const view = new DataView(buffer);
// WAV header...
return new Blob([buffer], { type: "audio/wav" });
}
```
### Pattern 2: Web Audio API AnalyserNode for waveform
**What:** Connect MediaStream to AnalyserNode; poll getByteFrequencyData in rAF loop
**When to use:** VoiceWaveform component (only while recording)
```typescript
// Source: MDN Web Audio API docs
const audioCtx = new AudioContext();
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 64; // 32 frequency bins
const source = audioCtx.createMediaStreamSource(stream);
source.connect(analyser);
const dataArray = new Uint8Array(analyser.frequencyBinCount); // 32 bars
function draw() {
animRef.current = requestAnimationFrame(draw);
analyser.getByteFrequencyData(dataArray);
// render bars to canvas
}
```
### Pattern 3: COOP/COEP headers in Express
**What:** Cross-origin isolation required for SharedArrayBuffer (used by AudioWorklet/ONNX)
**When to use:** All static responses and Vite dev server
```typescript
// Source: MDN - Cross-Origin Isolation
// In server/src/app.ts, before static/vite middleware:
app.use((_req, res, next) => {
res.setHeader("Cross-Origin-Opener-Policy", "same-origin");
res.setHeader("Cross-Origin-Embedder-Policy", "require-corp");
next();
});
// For Vite dev in vite.config.ts:
server: {
headers: {
"Cross-Origin-Opener-Policy": "same-origin",
"Cross-Origin-Embedder-Policy": "require-corp",
},
},
```
**Critical:** COEP `require-corp` means all cross-origin resources must opt-in with CORP headers. CDN-hosted VAD assets load via AudioWorklet (same-origin) so this is only a concern for user-loaded images. Serve VAD assets from `ui/public/` (same-origin) to avoid CORP issues entirely.
### Pattern 4: VAD asset setup (Vite)
**What:** Copy ONNX + worklet files to public/ so they are served at root
**When to use:** Build setup task
```bash
# After pnpm install, copy from node_modules:
cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js ui/public/
cp node_modules/@ricky0123/vad-web/dist/silero_vad_legacy.onnx ui/public/
cp node_modules/@ricky0123/vad-web/dist/silero_vad_v5.onnx ui/public/
```
Alternatively, add a `vite-plugin-static-copy` or script in `package.json prepare`:
```json
"scripts": {
"copy-vad-assets": "cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js public/ && cp node_modules/@ricky0123/vad-web/dist/*.onnx public/"
}
```
### Pattern 5: useVoiceMode hook
**What:** Reads voiceMode from GET /api/nexus-settings, writes via PATCH
**When to use:** VoiceModeToggle component; ChatPanel to pass voiceMode to stream call
```typescript
// Source: existing nexus-settings pattern in codebase
type VoiceMode = "text" | "voice_input" | "full_voice";
export function useVoiceMode() {
const [mode, setMode] = useState<VoiceMode>("text");
// Load on mount via GET /api/nexus-settings
// PATCH on change
return { mode, setMode: async (next: VoiceMode) => { ... } };
}
```
### Pattern 6: Float32Array → WAV Blob
**What:** Convert vad-react onSpeechEnd Float32Array (16kHz) to WAV for upload
**When to use:** useVadRecorder.ts, before POSTing to /api/transcribe
```typescript
// Source: standard WAV encoding algorithm (verified against multiple sources)
function encodeWav(samples: Float32Array, sampleRate = 16000): Blob {
const numSamples = samples.length;
const buffer = new ArrayBuffer(44 + numSamples * 2);
const view = new DataView(buffer);
// RIFF chunk
writeString(view, 0, "RIFF");
view.setUint32(4, 36 + numSamples * 2, true);
writeString(view, 8, "WAVE");
// fmt sub-chunk
writeString(view, 12, "fmt ");
view.setUint32(16, 16, true); // PCM
view.setUint16(20, 1, true); // PCM = 1
view.setUint16(22, 1, true); // mono
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 2, true); // byte rate
view.setUint16(32, 2, true); // block align
view.setUint16(34, 16, true); // bits per sample
// data sub-chunk
writeString(view, 36, "data");
view.setUint32(40, numSamples * 2, true);
let offset = 44;
for (let i = 0; i < numSamples; i++) {
const s = Math.max(-1, Math.min(1, samples[i]));
view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
offset += 2;
}
return new Blob([buffer], { type: "audio/wav" });
}
```
### Pattern 7: POST /api/synthesize + playback
**What:** Send text to synthesis endpoint, receive WAV buffer, play with native audio
**When to use:** ChatVoicePlayer when messageType is voice_full
```typescript
async function playVoiceResponse(text: string, autoPlay: boolean) {
const res = await fetch("/api/synthesize", {
method: "POST",
headers: { "Content-Type": "application/json" },
credentials: "include",
body: JSON.stringify({ text }),
});
const blob = await res.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
if (autoPlay) audio.play();
// expose pause/play controls; revoke URL on ended
audio.addEventListener("ended", () => URL.revokeObjectURL(url));
}
```
### Anti-Patterns to Avoid
- **Calling useMicVAD with startOnLoad: true:** Triggers immediate mic permission prompt on page load, not on user gesture. Always use `startOnLoad: false` and call `vad.start()` on mic button click.
- **Using AudioContext before user gesture:** Browsers require AudioContext creation/resume inside a user interaction. Create it lazily in the click handler, not on component mount.
- **Serving VAD assets from CDN with COEP require-corp:** CDN resources lack CORP headers. Will cause COEP fetch errors. Always copy to `ui/public/` and use `baseAssetPath: "/"`.
- **Not revoking blob URLs:** `URL.createObjectURL()` leaks memory if URLs are not revoked after use.
- **POSTing Float32Array directly to /api/transcribe:** The transcribe endpoint expects `audio/webm` or `audio/wav` multipart upload. Must encode Float32Array to WAV first.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Silence detection | Custom silence timer with AudioWorklet | @ricky0123/vad-react | Silero VAD model; handles background noise, breath, plosives; 37 published versions |
| WAV encoding | Custom encoder | 44-line standard WAV encoder (see Pattern 6) | Not complex enough for a library; standard algorithm |
| Audio playback | Custom audio element abstraction | Native `<audio>` + URL.createObjectURL() | Browser handles all codec/format negotiation |
| ONNX inference | Build ONNX runner | onnxruntime-web (peer dep of vad-web) | Already bundled |
---
## Common Pitfalls
### Pitfall 1: COEP blocks CDN asset loading
**What goes wrong:** After adding `Cross-Origin-Embedder-Policy: require-corp`, all cross-origin resources (Google Fonts, avatars from external URLs, CDN assets) are blocked unless they send `Cross-Origin-Resource-Policy: cross-origin`. Existing chat images from `/api/assets/` (same-origin) are fine, but any externally hosted content breaks.
**Why it happens:** COEP `require-corp` enforces CORP on all sub-resources.
**How to avoid:** Serve all VAD ONNX/worklet assets from `ui/public/` (same-origin). Audit for any cross-origin resource loads in existing chat components before adding headers.
**Warning signs:** Console errors: "COEP blocked cross-origin resource" for non-audio assets.
### Pitfall 2: AudioContext suspended due to autoplay policy
**What goes wrong:** `AudioContext.state === "suspended"` prevents AnalyserNode from producing data; waveform is all zeros.
**Why it happens:** Browsers require AudioContext to be created or resumed inside a user gesture (click/tap).
**How to avoid:** Create `new AudioContext()` lazily inside the mic button click handler. If the context exists but is suspended, call `context.resume()` before starting recording.
**Warning signs:** Waveform canvas renders but all bars are flat (zero amplitude).
### Pitfall 3: VAD model files not found
**What goes wrong:** `useMicVAD` throws or hangs with `loading: true` indefinitely; console shows 404 for `.onnx` or `.worklet.bundle.min.js`.
**Why it happens:** Default `baseAssetPath` may point to CDN; if COEP is active, CDN fetch is blocked. Or files were not copied to `ui/public/`.
**How to avoid:** Explicitly set `baseAssetPath: "/"` and `onnxWASMBasePath: "/"` in useMicVAD options. Verify files exist at `ui/public/vad.worklet.bundle.min.js`, `ui/public/silero_vad_legacy.onnx`, `ui/public/silero_vad_v5.onnx` after install.
**Warning signs:** `vad.loading === true` for more than 3 seconds; 404s in network tab.
### Pitfall 4: voiceMode not passed through useStreamingChat
**What goes wrong:** Sending a voice_input message, the server doesn't set `messageType: "voice_input"` on the stored message, so ChatVoiceBadge never renders.
**Why it happens:** `useStreamingChat.startStream()` current signature is `(userMessage: string, agentId?: string)` — no `voiceMode` parameter. Chat.ts only sets messageType when voiceMode is in the request body.
**How to avoid:** Extend `useStreamingChat.startStream()` to accept `voiceMode?: string` and pass it in the fetch body to `/api/conversations/${id}/stream`.
**Warning signs:** Voice messages render as plain user text without the voice badge.
### Pitfall 5: onSpeechEnd fires on very short utterances
**What goes wrong:** Background noise triggers `onSpeechEnd` with very short audio that produces garbage transcription.
**Why it happens:** VAD fires even for brief sounds if positiveThreshold is too low.
**How to avoid:** Set `minSpeechFrames: 3` (minimum ~180ms) and `positiveSpeechThreshold: 0.8` to filter noise. Display a "Too short" toast if the returned text is empty or < 2 chars.
**Warning signs:** Empty transcriptions appearing in chat; fast repeated submissions.
### Pitfall 6: Phase 36 deliverables not present in working branch
**What goes wrong:** Building ChatInput voice integration before `server/src/routes/voice.ts`, `server/src/services/nexus-settings.ts` voiceMode schema, and `voiceMode` in createMessageSchema are present causes compile errors and missing endpoints.
**Why it happens:** Only Phase 36 Task 1 (VoicePipelineService) is on the current branch.
**How to avoid:** Wave 0 must cherry-pick or re-implement Phase 36 Tasks 2-3 commits before any Phase 37 implementation work. Verify `GET /api/transcribe` and `GET /api/synthesize` return 200 before proceeding.
---
## Code Examples
### VoiceMicButton state machine
```typescript
// Source: 37-UI-SPEC.md + useMicVAD API docs
type RecordState = "idle" | "recording" | "processing";
function VoiceMicButton({ onTranscript }: { onTranscript: (text: string) => void }) {
const [state, setState] = useState<RecordState>("idle");
const vad = useMicVAD({
startOnLoad: false,
baseAssetPath: "/",
onnxWASMBasePath: "/",
onSpeechEnd: async (audio: Float32Array) => {
vad.pause();
setState("processing");
const wav = encodeWav(audio);
const form = new FormData();
form.append("audio", wav, "recording.wav");
const res = await fetch("/api/transcribe", {
method: "POST", credentials: "include", body: form,
});
const { text } = await res.json() as { text: string };
if (text?.trim()) onTranscript(text.trim());
setState("idle");
},
});
const handleClick = () => {
if (state === "idle") { vad.start(); setState("recording"); }
else if (state === "recording") { vad.pause(); setState("idle"); }
};
if (state === "processing") return <Button disabled><Loader2 className="h-4 w-4 animate-spin" /></Button>;
if (state === "recording") return (
<Button className="ring-2 ring-primary" onClick={handleClick} aria-label="Recording — speak now">
<VoiceWaveform listening={vad.listening} />
</Button>
);
return <Button onClick={handleClick} aria-label="Start voice input"><Mic className="h-4 w-4" /></Button>;
}
```
### ChatVoiceBadge (voice_full expand/collapse)
```typescript
// Source: 37-UI-SPEC.md; uses shadcn Collapsible (already installed)
import { Collapsible, CollapsibleContent, CollapsibleTrigger } from "@/components/ui/collapsible";
import { Badge } from "@/components/ui/badge";
function ChatVoiceBadge({ content, messageType }: { content: string; messageType: string }) {
const [open, setOpen] = useState(false);
const spokenMatch = content.match(/SPOKEN:\s*([\s\S]*?)(?=\nDETAILED:|$)/);
const spokenText = spokenMatch?.[1]?.trim() ?? content;
const detailedMatch = content.match(/DETAILED:\s*([\s\S]*)/);
return (
<div>
<Badge variant="outline" className="text-xs mb-2">Voice</Badge>
<p className="text-sm">{spokenText}</p>
{messageType === "voice_full" && detailedMatch && (
<Collapsible open={open} onOpenChange={setOpen}>
<CollapsibleTrigger className="text-xs text-muted-foreground hover:text-foreground mt-1">
{open ? "Hide full response" : "Show full response"}
</CollapsibleTrigger>
<CollapsibleContent>
<MarkdownBody className="text-sm mt-2">{detailedMatch[1].trim()}</MarkdownBody>
</CollapsibleContent>
</Collapsible>
)}
</div>
);
}
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| WebRTC VAD polyfill | Silero VAD via ONNX + AudioWorklet | 2023-2024 | Dramatically better accuracy; handles noisy environments |
| MediaRecorder → manual silence timer | @ricky0123/vad-react onSpeechEnd | 2023 | Eliminates timer tuning; model-based accuracy |
| Flash/plugin audio playback | Native `<audio>` + Web Audio API | 2015+ | Universal; no plugin required |
| Custom waveform libraries | Web Audio API AnalyserNode | Always | Zero dependency; 30fps canvas |
**Deprecated/outdated:**
- `annyang`, `artyom.js`: Web Speech API wrappers — browser-only, privacy concerns, no offline support
- Manual silence detection with `onaudioprocess`: Deprecated ScriptProcessor API; replaced by AudioWorklet
- MediaRecorder direct upload (VoiceRecordButton v1): Manual stop only; no auto-silence — replaced by useVadRecorder
---
## Open Questions
1. **autoPlay persistence: nexus-settings vs localStorage**
- What we know: nexus-settings already has voiceMode field. autoPlay (WCHAT-06) is a separate user preference.
- What's unclear: Should autoPlay live in nexus-settings (persisted server-side, works across devices) or localStorage (client-only, simpler)?
- Recommendation: Use localStorage key `nexus:voice:autoplay` for autoPlay — it is a per-device UX preference that doesn't need server-side persistence. Keeps nexus-settings lean.
2. **COEP impact on existing cross-origin resources**
- What we know: COEP `require-corp` blocks cross-origin resources without CORP header.
- What's unclear: Do existing Chat UI components load any cross-origin images (avatar CDN, external URLs in messages)?
- Recommendation: Audit `ui/src/components/ChatMessage.tsx` and `Identity.tsx` for external image src. If any exist, use `credentialless` instead of `require-corp` for COEP — this relaxes the restriction while still enabling SharedArrayBuffer in Chromium 96+. **MEDIUM confidence** — Firefox may not support `credentialless` mode.
3. **VAD false-positive rate in quiet environments**
- What we know: Silero VAD default thresholds are tuned for speech.
- What's unclear: In near-silent environments, keyboard noise or mouse clicks may trigger onSpeechEnd.
- Recommendation: Use `minSpeechFrames: 5` (300ms minimum) and add a `minSpeechFrames: 5` safety gate. Show "Too short, try again" toast if transcript is empty.
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| Node.js | build + tests | ✓ | v20.20.2 | — |
| pnpm | package install | ✓ | 9.15.4 | — |
| @ricky0123/vad-react | WCHAT-02 | ✗ (not installed) | — | Must install via pnpm |
| @ricky0123/vad-web | peer of vad-react | ✗ (not installed) | — | Installed automatically |
| onnxruntime-web | peer of vad-web | ✗ (not installed) | — | Installed automatically |
| Phase 36 Task 2-3 deliverables | All voice routes | ✗ (not on branch) | — | Wave 0 must cherry-pick or re-implement |
**Missing dependencies with no fallback:**
- @ricky0123/vad-react — must be installed (`pnpm add @ricky0123/vad-react --filter @paperclipai/ui`)
- Phase 36 server-side deliverables — POST /api/transcribe, POST /api/synthesize, nexus-settings voiceMode
**Missing dependencies with fallback:**
- None
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | vitest ^3.0.5 |
| Config file | `ui/vitest.config.ts` |
| Quick run command | `pnpm --filter @paperclipai/ui test --run` |
| Full suite command | `pnpm test --run` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| WCHAT-01 | VoiceMicButton renders idle/recording/processing states | unit | `pnpm --filter @paperclipai/ui test --run -- VoiceMicButton` | ❌ Wave 0 |
| WCHAT-02 | useVadRecorder calls onTranscript after onSpeechEnd fires | unit | `pnpm --filter @paperclipai/ui test --run -- useVadRecorder` | ❌ Wave 0 |
| WCHAT-03 | VoiceWaveform renders canvas with correct dimensions | unit | `pnpm --filter @paperclipai/ui test --run -- VoiceWaveform` | ❌ Wave 0 |
| WCHAT-04 | ChatVoicePlayer renders play button; auto-plays when autoPlay=true | unit | `pnpm --filter @paperclipai/ui test --run -- ChatVoicePlayer` | ❌ Wave 0 |
| WCHAT-05 | VoiceModeToggle renders three pills; click updates mode | unit | `pnpm --filter @paperclipai/ui test --run -- VoiceModeToggle` | ❌ Wave 0 |
| WCHAT-06 | useVoiceMode persists mode to nexus-settings; loads on mount | unit | `pnpm --filter @paperclipai/ui test --run -- useVoiceMode` | ❌ Wave 0 |
| WCHAT-01,02 | POST /api/transcribe returns { text } for WAV upload | unit (server) | `pnpm --filter @paperclipai/server test --run -- voice-routes` | ❌ (Phase 36 Task 3 — verify present) |
| WCHAT-04 | POST /api/synthesize returns audio/wav for text input | unit (server) | `pnpm --filter @paperclipai/server test --run -- voice-routes` | ❌ (Phase 36 Task 3 — verify present) |
| WCHAT-03 | encodeWav produces valid 44-byte WAV header | unit | `pnpm --filter @paperclipai/ui test --run -- encodeWav` | ❌ Wave 0 |
**Note:** UI tests use `// @vitest-environment jsdom` at the top of test files (see ChatInput.test.tsx pattern). All voice component tests must include this directive.
### Sampling Rate
- **Per task commit:** `pnpm --filter @paperclipai/ui test --run`
- **Per wave merge:** `pnpm test --run`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `ui/src/components/VoiceMicButton.test.tsx` — covers WCHAT-01
- [ ] `ui/src/hooks/useVadRecorder.test.ts` — covers WCHAT-02
- [ ] `ui/src/components/VoiceWaveform.test.tsx` — covers WCHAT-03
- [ ] `ui/src/components/ChatVoicePlayer.test.tsx` — covers WCHAT-04
- [ ] `ui/src/components/VoiceModeToggle.test.tsx` — covers WCHAT-05
- [ ] `ui/src/hooks/useVoiceMode.test.ts` — covers WCHAT-06
- [ ] `ui/src/lib/encodeWav.test.ts` — covers WAV encoding utility
- [ ] Verify `server/src/routes/voice.ts` present (Phase 36 Task 3)
- [ ] Verify `server/src/services/nexus-settings.ts` has voiceMode (Phase 36 Task 2)
---
## Sources
### Primary (HIGH confidence)
- npm registry — `@ricky0123/vad-react@0.0.36`, `@ricky0123/vad-web@0.0.30`, `onnxruntime-web@1.24.3` versions verified 2026-04-03
- https://docs.vad.ricky0123.com/user-guide/api/ — useMicVAD API properties (listening, loading, errored, userSpeaking, start, pause, onSpeechEnd, baseAssetPath, onnxWASMBasePath)
- MDN Web Audio API AnalyserNode documentation — waveform pattern
- 37-UI-SPEC.md (committed in `a0103337`) — component inventory, interaction states, copywriting contract
- 37-CONTEXT.md (committed in `30708d38`) — implementation decisions
### Secondary (MEDIUM confidence)
- Git history analysis (`fd372eaf`, `d0d7a23a`, `11508547`) — Phase 36 deliverable status
- https://web.dev/articles/coop-coep — COOP/COEP header semantics
- Vite docs — `server.headers` for dev server COOP/COEP
### Tertiary (LOW confidence)
- COEP `credentialless` alternative (open question #2) — browser support needs verification
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — npm registry confirmed versions; vad-react API verified from official docs
- Architecture: HIGH — derived from 37-UI-SPEC.md (committed) + existing codebase patterns
- Pitfalls: HIGH — based on verified browser behaviour (autoplay policy, COEP); LOW for pitfall #3 (threshold tuning is empirical)
- Branch status: HIGH — verified via `git log --all --oneline` + `git show` of specific commits
**Research date:** 2026-04-03
**Valid until:** 2026-05-03 (stable APIs; vad-react hasn't released a major version since 2023)