docs(37): phase research — VAD, COOP/COEP, component architecture

2026-04-04 02:07:19 +00:00 · 2026-04-04 02:07:19 +00:00 · fdc956c6a6
commit fdc956c6a6
parent f2f381a3a2
1 changed files with 567 additions and 0 deletions
--- a/.planning/phases/37-web-chat-voice-ui/37-RESEARCH.md
+++ b/.planning/phases/37-web-chat-voice-ui/37-RESEARCH.md
@ -0,0 +1,567 @@
+# Phase 37: Web Chat Voice UI - Research
+
+**Researched:** 2026-04-03
+**Domain:** Browser voice I/O — VAD, MediaRecorder, Web Audio API, waveform visualization, audio playback, COOP/COEP headers
+**Confidence:** HIGH
+
+---
+
+<user_constraints>
+## User Constraints (from CONTEXT.md)
+
+### Locked Decisions
+All implementation choices are at Claude's discretion — discuss phase was skipped per user setting.
+
+### Claude's Discretion
+All implementation details. Use ROADMAP phase goal, success criteria, and codebase conventions.
+
+Key research findings baked into context:
+- `@ricky0123/vad-react ^0.0.36` for browser-side silence detection (VAD)
+- COOP/COEP headers required on Express server for SharedArrayBuffer
+- Waveform via Web Audio API AnalyserNode (Canvas or SVG, 30-50 data points)
+- Native `<audio>` element + URL.createObjectURL() for playback
+- Three-state voice mode: "text" | "voice_input" | "full_voice"
+- VoiceMicButton replaces/enhances existing VoiceRecordButton
+- Voice badge + expandable markdown section in ChatMessage
+
+### Deferred Ideas (OUT OF SCOPE)
+None — discuss phase skipped.
+</user_constraints>
+
+<phase_requirements>
+## Phase Requirements
+
+| ID | Description | Research Support |
+|----|-------------|------------------|
+| WCHAT-01 | Mic button in chat input starts/stops voice recording with visual state (idle/recording/processing) | VoiceMicButton replaces VoiceRecordButton; three-state via recording/userSpeaking/loading from useMicVAD |
+| WCHAT-02 | Recording auto-stops on silence detection via VAD | useMicVAD onSpeechEnd callback fires automatically after 1.5s silence; no manual stop needed |
+| WCHAT-03 | Real-time waveform/amplitude visualization displays while recording | VoiceWaveform canvas component using Web Audio API AnalyserNode + requestAnimationFrame |
+| WCHAT-04 | Voice response audio plays inline in chat message with audio player controls | ChatVoicePlayer with native `<audio>` + URL.createObjectURL(); POST /api/synthesize → blob |
+| WCHAT-05 | User can toggle voice mode: text only / voice input only / full voice (input + output) | VoiceModeToggle three-pill component; persists to nexus-settings voiceMode field |
+| WCHAT-06 | Auto-play of voice responses is configurable (on/off in settings) | autoPlay flag in nexus-settings or localStorage; ChatVoicePlayer reads it on mount |
+</phase_requirements>
+
+---
+
+## Summary
+
+Phase 37 adds browser-based voice I/O to the existing web chat. Phase 36 delivered the server-side pipeline (VoicePipelineService, POST /api/transcribe, POST /api/synthesize, voiceMode wiring in chat.ts) and the nexus-settings schema extension. Phase 37 is entirely a frontend phase with one server-side addition: COOP/COEP response headers on the Express static middleware.
+
+The central library is `@ricky0123/vad-react ^0.0.36`, which wraps Silero VAD running in an AudioWorklet. It requires the page to be cross-origin isolated (COOP + COEP headers) to use SharedArrayBuffer. The package ships ONNX model files and a worklet bundle that must either be served locally from `public/` or loaded from its default CDN URLs. The CDN default is simpler and acceptable for development; production should serve them locally.
+
+Waveform visualization uses a standard Web Audio API AnalyserNode pattern: connect the microphone stream → AnalyserNode → read Uint8Array in requestAnimationFrame loop → render bars on a `<canvas>`. This is entirely in-browser with no extra library. Audio playback for synthesized responses uses the native `<audio>` HTML element with `URL.createObjectURL()` from a Blob received from POST /api/synthesize.
+
+**Primary recommendation:** Install @ricky0123/vad-react, add COOP/COEP headers to Express static/vite-dev middleware, serve VAD assets from `ui/public/`, build five new components + two hooks as specified in 37-UI-SPEC.md, extend ChatInput + ChatMessage, wire voiceMode through useStreamingChat.
+
+---
+
+## Branch Context (Critical)
+
+The current worktree branch (`gsd/phase-35-npx-buildthis-cli`) has only Phase 36 Task 1 committed (VoicePipelineService). The remaining Phase 36 deliverables live on a separate branch not yet merged:
+
+| Phase 36 Deliverable | Git Commit | Status in Current Branch |
+|---|---|---|
+| VoicePipelineService | `0ed912c2` | **PRESENT** |
+| nexus-settings voiceMode schema | `d0d7a23a` | **ABSENT** — must be built in 37 Wave 0 or assumed present |
+| voiceMode in createMessageSchema | `b964c0e4` | **ABSENT** |
+| POST /api/transcribe, POST /api/synthesize routes | `11508547` | **ABSENT** |
+| voiceMode wiring in chat.ts stream route | `fd372eaf` | **ABSENT** |
+
+**Implication for planning:** Wave 0 of Phase 37 must either (a) merge/cherry-pick the Phase 36 remainder, or (b) re-implement those 3 deliverables before building Phase 37 UI. The plan should treat Phase 36 tasks 2-3 as Wave 0 prerequisites and verify them before proceeding.
+
+The ChatInput.tsx, ChatMessage.tsx, VoiceRecordButton.tsx, and related UI components exist on the parent branch `PAP-878-create-a-mine-tab-in-inbox` but NOT in the current worktree. The plan must account for these being the integration targets.
+
+---
+
+## Standard Stack
+
+### Core
+| Library | Version | Purpose | Why Standard |
+|---------|---------|---------|--------------|
+| `@ricky0123/vad-react` | `^0.0.36` | Browser VAD with 1.5s silence auto-stop | Specified in 37-UI-SPEC.md; only mature browser-side VAD library |
+| `@ricky0123/vad-web` | `0.0.30` (peer) | VAD engine (AudioWorklet + Silero ONNX) | Peer dep of vad-react |
+| `onnxruntime-web` | `^1.17.0` (peer) | ONNX runtime for Silero model | Required by vad-web |
+| Web Audio API | browser built-in | AnalyserNode for waveform bars | Zero bundle cost; already in browser |
+| Native `<audio>` | browser built-in | Playback of synthesized WAV | No extra library needed |
+
+### Supporting
+| Library | Version | Purpose | When to Use |
+|---------|---------|---------|-------------|
+| lucide-react | `^0.574.0` (already in ui/) | `Mic`, `Square`, `Loader2`, `Volume2`, `Play`, `Pause` icons | Voice button states + audio player |
+| shadcn/ui `Badge` | already installed | Voice badge on agent messages | ChatVoiceBadge component |
+| shadcn/ui `Collapsible` | already installed | Expand/collapse full markdown in voice_full messages | ChatVoiceBadge expand section |
+
+### Alternatives Considered
+| Instead of | Could Use | Tradeoff |
+|------------|-----------|----------|
+| @ricky0123/vad-react | Manual silence detection with AudioWorklet | Much more complex; vad-react is the defacto standard |
+| Canvas waveform | SVG bars | Canvas performs better for 30fps animation |
+| Native `<audio>` + blob URL | Howler.js | No extra dependency; native handles WAV fine |
+
+**Installation:**
+```bash
+pnpm add @ricky0123/vad-react --filter @paperclipai/ui
+```
+
+**Version verification (confirmed against npm registry 2026-04-03):**
+- `@ricky0123/vad-react`: 0.0.36 (latest)
+- `@ricky0123/vad-web`: 0.0.30 (peer dependency, installed automatically)
+- `onnxruntime-web`: 1.24.3 (latest; ^1.17.0 from vad-web is satisfied)
+
+---
+
+## Architecture Patterns
+
+### Recommended Project Structure
+```
+ui/src/
+├── components/
+│   ├── VoiceMicButton.tsx       # Replaces VoiceRecordButton — VAD + waveform + three states
+│   ├── VoiceWaveform.tsx        # Canvas amplitude bars (30-50 points, 32px tall)
+│   ├── VoiceModeToggle.tsx      # Three-pill: Text / Voice In / Full Voice
+│   ├── ChatVoicePlayer.tsx      # Inline audio player with play/pause/progress
+│   └── ChatVoiceBadge.tsx       # "Voice" badge + collapsible full markdown
+├── hooks/
+│   ├── useVadRecorder.ts        # Wraps useMicVAD; exposes Float32Array on speech end
+│   └── useVoiceMode.ts          # Reads/writes voiceMode from nexus-settings
+ui/public/
+│   ├── vad.worklet.bundle.min.js   # From @ricky0123/vad-web/dist/
+│   ├── silero_vad_legacy.onnx      # From @ricky0123/vad-web/dist/
+│   └── silero_vad_v5.onnx          # From @ricky0123/vad-web/dist/
+server/src/
+└── app.ts  (add COOP/COEP headers middleware)
+```
+
+### Pattern 1: useMicVAD from @ricky0123/vad-react
+**What:** Hook that runs Silero VAD in an AudioWorklet; fires `onSpeechEnd(audio: Float32Array)` after silence
+**When to use:** VoiceMicButton and useVadRecorder hook
+
+```typescript
+// Source: https://docs.vad.ricky0123.com/user-guide/api/
+import { useMicVAD } from "@ricky0123/vad-react";
+
+const vad = useMicVAD({
+  startOnLoad: false,            // user must click mic button first
+  onSpeechEnd: (audio: Float32Array) => {
+    // audio is Float32Array at 16kHz
+    // Convert to WAV blob and POST to /api/transcribe
+  },
+  onSpeechStart: () => { /* update waveform active state */ },
+  positiveSpeechThreshold: 0.8,
+  negativeSpeechThreshold: 0.8 - 0.15,
+  redemptionFrames: 8,           // ~480ms silence before speech_end
+  baseAssetPath: "/",            // serve from ui/public/
+  onnxWASMBasePath: "/",
+});
+
+// Returned: { listening, loading, errored, userSpeaking, start, pause }
+```
+
+**Audio conversion for upload:**
+```typescript
+// Float32Array → WAV blob (16kHz, mono, 16-bit PCM)
+function float32ToWav(samples: Float32Array, sampleRate = 16000): Blob {
+  const buffer = new ArrayBuffer(44 + samples.length * 2);
+  const view = new DataView(buffer);
+  // WAV header...
+  return new Blob([buffer], { type: "audio/wav" });
+}
+```
+
+### Pattern 2: Web Audio API AnalyserNode for waveform
+**What:** Connect MediaStream to AnalyserNode; poll getByteFrequencyData in rAF loop
+**When to use:** VoiceWaveform component (only while recording)
+
+```typescript
+// Source: MDN Web Audio API docs
+const audioCtx = new AudioContext();
+const analyser = audioCtx.createAnalyser();
+analyser.fftSize = 64;          // 32 frequency bins
+const source = audioCtx.createMediaStreamSource(stream);
+source.connect(analyser);
+const dataArray = new Uint8Array(analyser.frequencyBinCount); // 32 bars
+
+function draw() {
+  animRef.current = requestAnimationFrame(draw);
+  analyser.getByteFrequencyData(dataArray);
+  // render bars to canvas
+}
+```
+
+### Pattern 3: COOP/COEP headers in Express
+**What:** Cross-origin isolation required for SharedArrayBuffer (used by AudioWorklet/ONNX)
+**When to use:** All static responses and Vite dev server
+
+```typescript
+// Source: MDN - Cross-Origin Isolation
+// In server/src/app.ts, before static/vite middleware:
+app.use((_req, res, next) => {
+  res.setHeader("Cross-Origin-Opener-Policy", "same-origin");
+  res.setHeader("Cross-Origin-Embedder-Policy", "require-corp");
+  next();
+});
+
+// For Vite dev in vite.config.ts:
+server: {
+  headers: {
+    "Cross-Origin-Opener-Policy": "same-origin",
+    "Cross-Origin-Embedder-Policy": "require-corp",
+  },
+},
+```
+
+**Critical:** COEP `require-corp` means all cross-origin resources must opt-in with CORP headers. CDN-hosted VAD assets load via AudioWorklet (same-origin) so this is only a concern for user-loaded images. Serve VAD assets from `ui/public/` (same-origin) to avoid CORP issues entirely.
+
+### Pattern 4: VAD asset setup (Vite)
+**What:** Copy ONNX + worklet files to public/ so they are served at root
+**When to use:** Build setup task
+
+```bash
+# After pnpm install, copy from node_modules:
+cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js ui/public/
+cp node_modules/@ricky0123/vad-web/dist/silero_vad_legacy.onnx ui/public/
+cp node_modules/@ricky0123/vad-web/dist/silero_vad_v5.onnx ui/public/
+```
+
+Alternatively, add a `vite-plugin-static-copy` or script in `package.json prepare`:
+```json
+"scripts": {
+  "copy-vad-assets": "cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js public/ && cp node_modules/@ricky0123/vad-web/dist/*.onnx public/"
+}
+```
+
+### Pattern 5: useVoiceMode hook
+**What:** Reads voiceMode from GET /api/nexus-settings, writes via PATCH
+**When to use:** VoiceModeToggle component; ChatPanel to pass voiceMode to stream call
+
+```typescript
+// Source: existing nexus-settings pattern in codebase
+type VoiceMode = "text" | "voice_input" | "full_voice";
+
+export function useVoiceMode() {
+  const [mode, setMode] = useState<VoiceMode>("text");
+  // Load on mount via GET /api/nexus-settings
+  // PATCH on change
+  return { mode, setMode: async (next: VoiceMode) => { ... } };
+}
+```
+
+### Pattern 6: Float32Array → WAV Blob
+**What:** Convert vad-react onSpeechEnd Float32Array (16kHz) to WAV for upload
+**When to use:** useVadRecorder.ts, before POSTing to /api/transcribe
+
+```typescript
+// Source: standard WAV encoding algorithm (verified against multiple sources)
+function encodeWav(samples: Float32Array, sampleRate = 16000): Blob {
+  const numSamples = samples.length;
+  const buffer = new ArrayBuffer(44 + numSamples * 2);
+  const view = new DataView(buffer);
+  // RIFF chunk
+  writeString(view, 0, "RIFF");
+  view.setUint32(4, 36 + numSamples * 2, true);
+  writeString(view, 8, "WAVE");
+  // fmt sub-chunk
+  writeString(view, 12, "fmt ");
+  view.setUint32(16, 16, true);  // PCM
+  view.setUint16(20, 1, true);   // PCM = 1
+  view.setUint16(22, 1, true);   // mono
+  view.setUint32(24, sampleRate, true);
+  view.setUint32(28, sampleRate * 2, true);  // byte rate
+  view.setUint16(32, 2, true);   // block align
+  view.setUint16(34, 16, true);  // bits per sample
+  // data sub-chunk
+  writeString(view, 36, "data");
+  view.setUint32(40, numSamples * 2, true);
+  let offset = 44;
+  for (let i = 0; i < numSamples; i++) {
+    const s = Math.max(-1, Math.min(1, samples[i]));
+    view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
+    offset += 2;
+  }
+  return new Blob([buffer], { type: "audio/wav" });
+}
+```
+
+### Pattern 7: POST /api/synthesize + playback
+**What:** Send text to synthesis endpoint, receive WAV buffer, play with native audio
+**When to use:** ChatVoicePlayer when messageType is voice_full
+
+```typescript
+async function playVoiceResponse(text: string, autoPlay: boolean) {
+  const res = await fetch("/api/synthesize", {
+    method: "POST",
+    headers: { "Content-Type": "application/json" },
+    credentials: "include",
+    body: JSON.stringify({ text }),
+  });
+  const blob = await res.blob();
+  const url = URL.createObjectURL(blob);
+  const audio = new Audio(url);
+  if (autoPlay) audio.play();
+  // expose pause/play controls; revoke URL on ended
+  audio.addEventListener("ended", () => URL.revokeObjectURL(url));
+}
+```
+
+### Anti-Patterns to Avoid
+- **Calling useMicVAD with startOnLoad: true:** Triggers immediate mic permission prompt on page load, not on user gesture. Always use `startOnLoad: false` and call `vad.start()` on mic button click.
+- **Using AudioContext before user gesture:** Browsers require AudioContext creation/resume inside a user interaction. Create it lazily in the click handler, not on component mount.
+- **Serving VAD assets from CDN with COEP require-corp:** CDN resources lack CORP headers. Will cause COEP fetch errors. Always copy to `ui/public/` and use `baseAssetPath: "/"`.
+- **Not revoking blob URLs:** `URL.createObjectURL()` leaks memory if URLs are not revoked after use.
+- **POSTing Float32Array directly to /api/transcribe:** The transcribe endpoint expects `audio/webm` or `audio/wav` multipart upload. Must encode Float32Array to WAV first.
+
+---
+
+## Don't Hand-Roll
+
+| Problem | Don't Build | Use Instead | Why |
+|---------|-------------|-------------|-----|
+| Silence detection | Custom silence timer with AudioWorklet | @ricky0123/vad-react | Silero VAD model; handles background noise, breath, plosives; 37 published versions |
+| WAV encoding | Custom encoder | 44-line standard WAV encoder (see Pattern 6) | Not complex enough for a library; standard algorithm |
+| Audio playback | Custom audio element abstraction | Native `<audio>` + URL.createObjectURL() | Browser handles all codec/format negotiation |
+| ONNX inference | Build ONNX runner | onnxruntime-web (peer dep of vad-web) | Already bundled |
+
+---
+
+## Common Pitfalls
+
+### Pitfall 1: COEP blocks CDN asset loading
+**What goes wrong:** After adding `Cross-Origin-Embedder-Policy: require-corp`, all cross-origin resources (Google Fonts, avatars from external URLs, CDN assets) are blocked unless they send `Cross-Origin-Resource-Policy: cross-origin`. Existing chat images from `/api/assets/` (same-origin) are fine, but any externally hosted content breaks.
+**Why it happens:** COEP `require-corp` enforces CORP on all sub-resources.
+**How to avoid:** Serve all VAD ONNX/worklet assets from `ui/public/` (same-origin). Audit for any cross-origin resource loads in existing chat components before adding headers.
+**Warning signs:** Console errors: "COEP blocked cross-origin resource" for non-audio assets.
+
+### Pitfall 2: AudioContext suspended due to autoplay policy
+**What goes wrong:** `AudioContext.state === "suspended"` prevents AnalyserNode from producing data; waveform is all zeros.
+**Why it happens:** Browsers require AudioContext to be created or resumed inside a user gesture (click/tap).
+**How to avoid:** Create `new AudioContext()` lazily inside the mic button click handler. If the context exists but is suspended, call `context.resume()` before starting recording.
+**Warning signs:** Waveform canvas renders but all bars are flat (zero amplitude).
+
+### Pitfall 3: VAD model files not found
+**What goes wrong:** `useMicVAD` throws or hangs with `loading: true` indefinitely; console shows 404 for `.onnx` or `.worklet.bundle.min.js`.
+**Why it happens:** Default `baseAssetPath` may point to CDN; if COEP is active, CDN fetch is blocked. Or files were not copied to `ui/public/`.
+**How to avoid:** Explicitly set `baseAssetPath: "/"` and `onnxWASMBasePath: "/"` in useMicVAD options. Verify files exist at `ui/public/vad.worklet.bundle.min.js`, `ui/public/silero_vad_legacy.onnx`, `ui/public/silero_vad_v5.onnx` after install.
+**Warning signs:** `vad.loading === true` for more than 3 seconds; 404s in network tab.
+
+### Pitfall 4: voiceMode not passed through useStreamingChat
+**What goes wrong:** Sending a voice_input message, the server doesn't set `messageType: "voice_input"` on the stored message, so ChatVoiceBadge never renders.
+**Why it happens:** `useStreamingChat.startStream()` current signature is `(userMessage: string, agentId?: string)` — no `voiceMode` parameter. Chat.ts only sets messageType when voiceMode is in the request body.
+**How to avoid:** Extend `useStreamingChat.startStream()` to accept `voiceMode?: string` and pass it in the fetch body to `/api/conversations/${id}/stream`.
+**Warning signs:** Voice messages render as plain user text without the voice badge.
+
+### Pitfall 5: onSpeechEnd fires on very short utterances
+**What goes wrong:** Background noise triggers `onSpeechEnd` with very short audio that produces garbage transcription.
+**Why it happens:** VAD fires even for brief sounds if positiveThreshold is too low.
+**How to avoid:** Set `minSpeechFrames: 3` (minimum ~180ms) and `positiveSpeechThreshold: 0.8` to filter noise. Display a "Too short" toast if the returned text is empty or < 2 chars.
+**Warning signs:** Empty transcriptions appearing in chat; fast repeated submissions.
+
+### Pitfall 6: Phase 36 deliverables not present in working branch
+**What goes wrong:** Building ChatInput voice integration before `server/src/routes/voice.ts`, `server/src/services/nexus-settings.ts` voiceMode schema, and `voiceMode` in createMessageSchema are present causes compile errors and missing endpoints.
+**Why it happens:** Only Phase 36 Task 1 (VoicePipelineService) is on the current branch.
+**How to avoid:** Wave 0 must cherry-pick or re-implement Phase 36 Tasks 2-3 commits before any Phase 37 implementation work. Verify `GET /api/transcribe` and `GET /api/synthesize` return 200 before proceeding.
+
+---
+
+## Code Examples
+
+### VoiceMicButton state machine
+```typescript
+// Source: 37-UI-SPEC.md + useMicVAD API docs
+type RecordState = "idle" | "recording" | "processing";
+
+function VoiceMicButton({ onTranscript }: { onTranscript: (text: string) => void }) {
+  const [state, setState] = useState<RecordState>("idle");
+  const vad = useMicVAD({
+    startOnLoad: false,
+    baseAssetPath: "/",
+    onnxWASMBasePath: "/",
+    onSpeechEnd: async (audio: Float32Array) => {
+      vad.pause();
+      setState("processing");
+      const wav = encodeWav(audio);
+      const form = new FormData();
+      form.append("audio", wav, "recording.wav");
+      const res = await fetch("/api/transcribe", {
+        method: "POST", credentials: "include", body: form,
+      });
+      const { text } = await res.json() as { text: string };
+      if (text?.trim()) onTranscript(text.trim());
+      setState("idle");
+    },
+  });
+
+  const handleClick = () => {
+    if (state === "idle") { vad.start(); setState("recording"); }
+    else if (state === "recording") { vad.pause(); setState("idle"); }
+  };
+
+  if (state === "processing") return <Button disabled><Loader2 className="h-4 w-4 animate-spin" /></Button>;
+  if (state === "recording") return (
+    <Button className="ring-2 ring-primary" onClick={handleClick} aria-label="Recording — speak now">
+      <VoiceWaveform listening={vad.listening} />
+    </Button>
+  );
+  return <Button onClick={handleClick} aria-label="Start voice input"><Mic className="h-4 w-4" /></Button>;
+}
+```
+
+### ChatVoiceBadge (voice_full expand/collapse)
+```typescript
+// Source: 37-UI-SPEC.md; uses shadcn Collapsible (already installed)
+import { Collapsible, CollapsibleContent, CollapsibleTrigger } from "@/components/ui/collapsible";
+import { Badge } from "@/components/ui/badge";
+
+function ChatVoiceBadge({ content, messageType }: { content: string; messageType: string }) {
+  const [open, setOpen] = useState(false);
+  const spokenMatch = content.match(/SPOKEN:\s*([\s\S]*?)(?=\nDETAILED:|$)/);
+  const spokenText = spokenMatch?.[1]?.trim() ?? content;
+  const detailedMatch = content.match(/DETAILED:\s*([\s\S]*)/);
+
+  return (
+    <div>
+      <Badge variant="outline" className="text-xs mb-2">Voice</Badge>
+      <p className="text-sm">{spokenText}</p>
+      {messageType === "voice_full" && detailedMatch && (
+        <Collapsible open={open} onOpenChange={setOpen}>
+          <CollapsibleTrigger className="text-xs text-muted-foreground hover:text-foreground mt-1">
+            {open ? "Hide full response" : "Show full response"}
+          </CollapsibleTrigger>
+          <CollapsibleContent>
+            <MarkdownBody className="text-sm mt-2">{detailedMatch[1].trim()}</MarkdownBody>
+          </CollapsibleContent>
+        </Collapsible>
+      )}
+    </div>
+  );
+}
+```
+
+---
+
+## State of the Art
+
+| Old Approach | Current Approach | When Changed | Impact |
+|--------------|------------------|--------------|--------|
+| WebRTC VAD polyfill | Silero VAD via ONNX + AudioWorklet | 2023-2024 | Dramatically better accuracy; handles noisy environments |
+| MediaRecorder → manual silence timer | @ricky0123/vad-react onSpeechEnd | 2023 | Eliminates timer tuning; model-based accuracy |
+| Flash/plugin audio playback | Native `<audio>` + Web Audio API | 2015+ | Universal; no plugin required |
+| Custom waveform libraries | Web Audio API AnalyserNode | Always | Zero dependency; 30fps canvas |
+
+**Deprecated/outdated:**
+- `annyang`, `artyom.js`: Web Speech API wrappers — browser-only, privacy concerns, no offline support
+- Manual silence detection with `onaudioprocess`: Deprecated ScriptProcessor API; replaced by AudioWorklet
+- MediaRecorder direct upload (VoiceRecordButton v1): Manual stop only; no auto-silence — replaced by useVadRecorder
+
+---
+
+## Open Questions
+
+1. **autoPlay persistence: nexus-settings vs localStorage**
+   - What we know: nexus-settings already has voiceMode field. autoPlay (WCHAT-06) is a separate user preference.
+   - What's unclear: Should autoPlay live in nexus-settings (persisted server-side, works across devices) or localStorage (client-only, simpler)?
+   - Recommendation: Use localStorage key `nexus:voice:autoplay` for autoPlay — it is a per-device UX preference that doesn't need server-side persistence. Keeps nexus-settings lean.
+
+2. **COEP impact on existing cross-origin resources**
+   - What we know: COEP `require-corp` blocks cross-origin resources without CORP header.
+   - What's unclear: Do existing Chat UI components load any cross-origin images (avatar CDN, external URLs in messages)?
+   - Recommendation: Audit `ui/src/components/ChatMessage.tsx` and `Identity.tsx` for external image src. If any exist, use `credentialless` instead of `require-corp` for COEP — this relaxes the restriction while still enabling SharedArrayBuffer in Chromium 96+. **MEDIUM confidence** — Firefox may not support `credentialless` mode.
+
+3. **VAD false-positive rate in quiet environments**
+   - What we know: Silero VAD default thresholds are tuned for speech.
+   - What's unclear: In near-silent environments, keyboard noise or mouse clicks may trigger onSpeechEnd.
+   - Recommendation: Use `minSpeechFrames: 5` (300ms minimum) and add a `minSpeechFrames: 5` safety gate. Show "Too short, try again" toast if transcript is empty.
+
+---
+
+## Environment Availability
+
+| Dependency | Required By | Available | Version | Fallback |
+|------------|------------|-----------|---------|----------|
+| Node.js | build + tests | ✓ | v20.20.2 | — |
+| pnpm | package install | ✓ | 9.15.4 | — |
+| @ricky0123/vad-react | WCHAT-02 | ✗ (not installed) | — | Must install via pnpm |
+| @ricky0123/vad-web | peer of vad-react | ✗ (not installed) | — | Installed automatically |
+| onnxruntime-web | peer of vad-web | ✗ (not installed) | — | Installed automatically |
+| Phase 36 Task 2-3 deliverables | All voice routes | ✗ (not on branch) | — | Wave 0 must cherry-pick or re-implement |
+
+**Missing dependencies with no fallback:**
+- @ricky0123/vad-react — must be installed (`pnpm add @ricky0123/vad-react --filter @paperclipai/ui`)
+- Phase 36 server-side deliverables — POST /api/transcribe, POST /api/synthesize, nexus-settings voiceMode
+
+**Missing dependencies with fallback:**
+- None
+
+---
+
+## Validation Architecture
+
+### Test Framework
+| Property | Value |
+|----------|-------|
+| Framework | vitest ^3.0.5 |
+| Config file | `ui/vitest.config.ts` |
+| Quick run command | `pnpm --filter @paperclipai/ui test --run` |
+| Full suite command | `pnpm test --run` |
+
+### Phase Requirements → Test Map
+| Req ID | Behavior | Test Type | Automated Command | File Exists? |
+|--------|----------|-----------|-------------------|-------------|
+| WCHAT-01 | VoiceMicButton renders idle/recording/processing states | unit | `pnpm --filter @paperclipai/ui test --run -- VoiceMicButton` | ❌ Wave 0 |
+| WCHAT-02 | useVadRecorder calls onTranscript after onSpeechEnd fires | unit | `pnpm --filter @paperclipai/ui test --run -- useVadRecorder` | ❌ Wave 0 |
+| WCHAT-03 | VoiceWaveform renders canvas with correct dimensions | unit | `pnpm --filter @paperclipai/ui test --run -- VoiceWaveform` | ❌ Wave 0 |
+| WCHAT-04 | ChatVoicePlayer renders play button; auto-plays when autoPlay=true | unit | `pnpm --filter @paperclipai/ui test --run -- ChatVoicePlayer` | ❌ Wave 0 |
+| WCHAT-05 | VoiceModeToggle renders three pills; click updates mode | unit | `pnpm --filter @paperclipai/ui test --run -- VoiceModeToggle` | ❌ Wave 0 |
+| WCHAT-06 | useVoiceMode persists mode to nexus-settings; loads on mount | unit | `pnpm --filter @paperclipai/ui test --run -- useVoiceMode` | ❌ Wave 0 |
+| WCHAT-01,02 | POST /api/transcribe returns { text } for WAV upload | unit (server) | `pnpm --filter @paperclipai/server test --run -- voice-routes` | ❌ (Phase 36 Task 3 — verify present) |
+| WCHAT-04 | POST /api/synthesize returns audio/wav for text input | unit (server) | `pnpm --filter @paperclipai/server test --run -- voice-routes` | ❌ (Phase 36 Task 3 — verify present) |
+| WCHAT-03 | encodeWav produces valid 44-byte WAV header | unit | `pnpm --filter @paperclipai/ui test --run -- encodeWav` | ❌ Wave 0 |
+
+**Note:** UI tests use `// @vitest-environment jsdom` at the top of test files (see ChatInput.test.tsx pattern). All voice component tests must include this directive.
+
+### Sampling Rate
+- **Per task commit:** `pnpm --filter @paperclipai/ui test --run`
+- **Per wave merge:** `pnpm test --run`
+- **Phase gate:** Full suite green before `/gsd:verify-work`
+
+### Wave 0 Gaps
+- [ ] `ui/src/components/VoiceMicButton.test.tsx` — covers WCHAT-01
+- [ ] `ui/src/hooks/useVadRecorder.test.ts` — covers WCHAT-02
+- [ ] `ui/src/components/VoiceWaveform.test.tsx` — covers WCHAT-03
+- [ ] `ui/src/components/ChatVoicePlayer.test.tsx` — covers WCHAT-04
+- [ ] `ui/src/components/VoiceModeToggle.test.tsx` — covers WCHAT-05
+- [ ] `ui/src/hooks/useVoiceMode.test.ts` — covers WCHAT-06
+- [ ] `ui/src/lib/encodeWav.test.ts` — covers WAV encoding utility
+- [ ] Verify `server/src/routes/voice.ts` present (Phase 36 Task 3)
+- [ ] Verify `server/src/services/nexus-settings.ts` has voiceMode (Phase 36 Task 2)
+
+---
+
+## Sources
+
+### Primary (HIGH confidence)
+- npm registry — `@ricky0123/vad-react@0.0.36`, `@ricky0123/vad-web@0.0.30`, `onnxruntime-web@1.24.3` versions verified 2026-04-03
+- https://docs.vad.ricky0123.com/user-guide/api/ — useMicVAD API properties (listening, loading, errored, userSpeaking, start, pause, onSpeechEnd, baseAssetPath, onnxWASMBasePath)
+- MDN Web Audio API AnalyserNode documentation — waveform pattern
+- 37-UI-SPEC.md (committed in `a0103337`) — component inventory, interaction states, copywriting contract
+- 37-CONTEXT.md (committed in `30708d38`) — implementation decisions
+
+### Secondary (MEDIUM confidence)
+- Git history analysis (`fd372eaf`, `d0d7a23a`, `11508547`) — Phase 36 deliverable status
+- https://web.dev/articles/coop-coep — COOP/COEP header semantics
+- Vite docs — `server.headers` for dev server COOP/COEP
+
+### Tertiary (LOW confidence)
+- COEP `credentialless` alternative (open question #2) — browser support needs verification
+
+---
+
+## Metadata
+
+**Confidence breakdown:**
+- Standard stack: HIGH — npm registry confirmed versions; vad-react API verified from official docs
+- Architecture: HIGH — derived from 37-UI-SPEC.md (committed) + existing codebase patterns
+- Pitfalls: HIGH — based on verified browser behaviour (autoplay policy, COEP); LOW for pitfall #3 (threshold tuning is empirical)
+- Branch status: HIGH — verified via `git log --all --oneline` + `git show` of specific commits
+
+**Research date:** 2026-04-03
+**Valid until:** 2026-05-03 (stable APIs; vad-react hasn't released a major version since 2023)