30 KiB
Phase 37: Web Chat Voice UI - Research
Researched: 2026-04-03 Domain: Browser voice I/O — VAD, MediaRecorder, Web Audio API, waveform visualization, audio playback, COOP/COEP headers Confidence: HIGH
<user_constraints>
User Constraints (from CONTEXT.md)
Locked Decisions
All implementation choices are at Claude's discretion — discuss phase was skipped per user setting.
Claude's Discretion
All implementation details. Use ROADMAP phase goal, success criteria, and codebase conventions.
Key research findings baked into context:
@ricky0123/vad-react ^0.0.36for browser-side silence detection (VAD)- COOP/COEP headers required on Express server for SharedArrayBuffer
- Waveform via Web Audio API AnalyserNode (Canvas or SVG, 30-50 data points)
- Native
<audio>element + URL.createObjectURL() for playback - Three-state voice mode: "text" | "voice_input" | "full_voice"
- VoiceMicButton replaces/enhances existing VoiceRecordButton
- Voice badge + expandable markdown section in ChatMessage
Deferred Ideas (OUT OF SCOPE)
None — discuss phase skipped. </user_constraints>
<phase_requirements>
Phase Requirements
| ID | Description | Research Support |
|---|---|---|
| WCHAT-01 | Mic button in chat input starts/stops voice recording with visual state (idle/recording/processing) | VoiceMicButton replaces VoiceRecordButton; three-state via recording/userSpeaking/loading from useMicVAD |
| WCHAT-02 | Recording auto-stops on silence detection via VAD | useMicVAD onSpeechEnd callback fires automatically after 1.5s silence; no manual stop needed |
| WCHAT-03 | Real-time waveform/amplitude visualization displays while recording | VoiceWaveform canvas component using Web Audio API AnalyserNode + requestAnimationFrame |
| WCHAT-04 | Voice response audio plays inline in chat message with audio player controls | ChatVoicePlayer with native <audio> + URL.createObjectURL(); POST /api/synthesize → blob |
| WCHAT-05 | User can toggle voice mode: text only / voice input only / full voice (input + output) | VoiceModeToggle three-pill component; persists to nexus-settings voiceMode field |
| WCHAT-06 | Auto-play of voice responses is configurable (on/off in settings) | autoPlay flag in nexus-settings or localStorage; ChatVoicePlayer reads it on mount |
| </phase_requirements> |
Summary
Phase 37 adds browser-based voice I/O to the existing web chat. Phase 36 delivered the server-side pipeline (VoicePipelineService, POST /api/transcribe, POST /api/synthesize, voiceMode wiring in chat.ts) and the nexus-settings schema extension. Phase 37 is entirely a frontend phase with one server-side addition: COOP/COEP response headers on the Express static middleware.
The central library is @ricky0123/vad-react ^0.0.36, which wraps Silero VAD running in an AudioWorklet. It requires the page to be cross-origin isolated (COOP + COEP headers) to use SharedArrayBuffer. The package ships ONNX model files and a worklet bundle that must either be served locally from public/ or loaded from its default CDN URLs. The CDN default is simpler and acceptable for development; production should serve them locally.
Waveform visualization uses a standard Web Audio API AnalyserNode pattern: connect the microphone stream → AnalyserNode → read Uint8Array in requestAnimationFrame loop → render bars on a <canvas>. This is entirely in-browser with no extra library. Audio playback for synthesized responses uses the native <audio> HTML element with URL.createObjectURL() from a Blob received from POST /api/synthesize.
Primary recommendation: Install @ricky0123/vad-react, add COOP/COEP headers to Express static/vite-dev middleware, serve VAD assets from ui/public/, build five new components + two hooks as specified in 37-UI-SPEC.md, extend ChatInput + ChatMessage, wire voiceMode through useStreamingChat.
Branch Context (Critical)
The current worktree branch (gsd/phase-35-npx-buildthis-cli) has only Phase 36 Task 1 committed (VoicePipelineService). The remaining Phase 36 deliverables live on a separate branch not yet merged:
| Phase 36 Deliverable | Git Commit | Status in Current Branch |
|---|---|---|
| VoicePipelineService | 0ed912c2 |
PRESENT |
| nexus-settings voiceMode schema | d0d7a23a |
ABSENT — must be built in 37 Wave 0 or assumed present |
| voiceMode in createMessageSchema | b964c0e4 |
ABSENT |
| POST /api/transcribe, POST /api/synthesize routes | 11508547 |
ABSENT |
| voiceMode wiring in chat.ts stream route | fd372eaf |
ABSENT |
Implication for planning: Wave 0 of Phase 37 must either (a) merge/cherry-pick the Phase 36 remainder, or (b) re-implement those 3 deliverables before building Phase 37 UI. The plan should treat Phase 36 tasks 2-3 as Wave 0 prerequisites and verify them before proceeding.
The ChatInput.tsx, ChatMessage.tsx, VoiceRecordButton.tsx, and related UI components exist on the parent branch PAP-878-create-a-mine-tab-in-inbox but NOT in the current worktree. The plan must account for these being the integration targets.
Standard Stack
Core
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
@ricky0123/vad-react |
^0.0.36 |
Browser VAD with 1.5s silence auto-stop | Specified in 37-UI-SPEC.md; only mature browser-side VAD library |
@ricky0123/vad-web |
0.0.30 (peer) |
VAD engine (AudioWorklet + Silero ONNX) | Peer dep of vad-react |
onnxruntime-web |
^1.17.0 (peer) |
ONNX runtime for Silero model | Required by vad-web |
| Web Audio API | browser built-in | AnalyserNode for waveform bars | Zero bundle cost; already in browser |
Native <audio> |
browser built-in | Playback of synthesized WAV | No extra library needed |
Supporting
| Library | Version | Purpose | When to Use |
|---|---|---|---|
| lucide-react | ^0.574.0 (already in ui/) |
Mic, Square, Loader2, Volume2, Play, Pause icons |
Voice button states + audio player |
shadcn/ui Badge |
already installed | Voice badge on agent messages | ChatVoiceBadge component |
shadcn/ui Collapsible |
already installed | Expand/collapse full markdown in voice_full messages | ChatVoiceBadge expand section |
Alternatives Considered
| Instead of | Could Use | Tradeoff |
|---|---|---|
| @ricky0123/vad-react | Manual silence detection with AudioWorklet | Much more complex; vad-react is the defacto standard |
| Canvas waveform | SVG bars | Canvas performs better for 30fps animation |
Native <audio> + blob URL |
Howler.js | No extra dependency; native handles WAV fine |
Installation:
pnpm add @ricky0123/vad-react --filter @paperclipai/ui
Version verification (confirmed against npm registry 2026-04-03):
@ricky0123/vad-react: 0.0.36 (latest)@ricky0123/vad-web: 0.0.30 (peer dependency, installed automatically)onnxruntime-web: 1.24.3 (latest; ^1.17.0 from vad-web is satisfied)
Architecture Patterns
Recommended Project Structure
ui/src/
├── components/
│ ├── VoiceMicButton.tsx # Replaces VoiceRecordButton — VAD + waveform + three states
│ ├── VoiceWaveform.tsx # Canvas amplitude bars (30-50 points, 32px tall)
│ ├── VoiceModeToggle.tsx # Three-pill: Text / Voice In / Full Voice
│ ├── ChatVoicePlayer.tsx # Inline audio player with play/pause/progress
│ └── ChatVoiceBadge.tsx # "Voice" badge + collapsible full markdown
├── hooks/
│ ├── useVadRecorder.ts # Wraps useMicVAD; exposes Float32Array on speech end
│ └── useVoiceMode.ts # Reads/writes voiceMode from nexus-settings
ui/public/
│ ├── vad.worklet.bundle.min.js # From @ricky0123/vad-web/dist/
│ ├── silero_vad_legacy.onnx # From @ricky0123/vad-web/dist/
│ └── silero_vad_v5.onnx # From @ricky0123/vad-web/dist/
server/src/
└── app.ts (add COOP/COEP headers middleware)
Pattern 1: useMicVAD from @ricky0123/vad-react
What: Hook that runs Silero VAD in an AudioWorklet; fires onSpeechEnd(audio: Float32Array) after silence
When to use: VoiceMicButton and useVadRecorder hook
// Source: https://docs.vad.ricky0123.com/user-guide/api/
import { useMicVAD } from "@ricky0123/vad-react";
const vad = useMicVAD({
startOnLoad: false, // user must click mic button first
onSpeechEnd: (audio: Float32Array) => {
// audio is Float32Array at 16kHz
// Convert to WAV blob and POST to /api/transcribe
},
onSpeechStart: () => { /* update waveform active state */ },
positiveSpeechThreshold: 0.8,
negativeSpeechThreshold: 0.8 - 0.15,
redemptionFrames: 8, // ~480ms silence before speech_end
baseAssetPath: "/", // serve from ui/public/
onnxWASMBasePath: "/",
});
// Returned: { listening, loading, errored, userSpeaking, start, pause }
Audio conversion for upload:
// Float32Array → WAV blob (16kHz, mono, 16-bit PCM)
function float32ToWav(samples: Float32Array, sampleRate = 16000): Blob {
const buffer = new ArrayBuffer(44 + samples.length * 2);
const view = new DataView(buffer);
// WAV header...
return new Blob([buffer], { type: "audio/wav" });
}
Pattern 2: Web Audio API AnalyserNode for waveform
What: Connect MediaStream to AnalyserNode; poll getByteFrequencyData in rAF loop When to use: VoiceWaveform component (only while recording)
// Source: MDN Web Audio API docs
const audioCtx = new AudioContext();
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 64; // 32 frequency bins
const source = audioCtx.createMediaStreamSource(stream);
source.connect(analyser);
const dataArray = new Uint8Array(analyser.frequencyBinCount); // 32 bars
function draw() {
animRef.current = requestAnimationFrame(draw);
analyser.getByteFrequencyData(dataArray);
// render bars to canvas
}
Pattern 3: COOP/COEP headers in Express
What: Cross-origin isolation required for SharedArrayBuffer (used by AudioWorklet/ONNX) When to use: All static responses and Vite dev server
// Source: MDN - Cross-Origin Isolation
// In server/src/app.ts, before static/vite middleware:
app.use((_req, res, next) => {
res.setHeader("Cross-Origin-Opener-Policy", "same-origin");
res.setHeader("Cross-Origin-Embedder-Policy", "require-corp");
next();
});
// For Vite dev in vite.config.ts:
server: {
headers: {
"Cross-Origin-Opener-Policy": "same-origin",
"Cross-Origin-Embedder-Policy": "require-corp",
},
},
Critical: COEP require-corp means all cross-origin resources must opt-in with CORP headers. CDN-hosted VAD assets load via AudioWorklet (same-origin) so this is only a concern for user-loaded images. Serve VAD assets from ui/public/ (same-origin) to avoid CORP issues entirely.
Pattern 4: VAD asset setup (Vite)
What: Copy ONNX + worklet files to public/ so they are served at root When to use: Build setup task
# After pnpm install, copy from node_modules:
cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js ui/public/
cp node_modules/@ricky0123/vad-web/dist/silero_vad_legacy.onnx ui/public/
cp node_modules/@ricky0123/vad-web/dist/silero_vad_v5.onnx ui/public/
Alternatively, add a vite-plugin-static-copy or script in package.json prepare:
"scripts": {
"copy-vad-assets": "cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js public/ && cp node_modules/@ricky0123/vad-web/dist/*.onnx public/"
}
Pattern 5: useVoiceMode hook
What: Reads voiceMode from GET /api/nexus-settings, writes via PATCH When to use: VoiceModeToggle component; ChatPanel to pass voiceMode to stream call
// Source: existing nexus-settings pattern in codebase
type VoiceMode = "text" | "voice_input" | "full_voice";
export function useVoiceMode() {
const [mode, setMode] = useState<VoiceMode>("text");
// Load on mount via GET /api/nexus-settings
// PATCH on change
return { mode, setMode: async (next: VoiceMode) => { ... } };
}
Pattern 6: Float32Array → WAV Blob
What: Convert vad-react onSpeechEnd Float32Array (16kHz) to WAV for upload When to use: useVadRecorder.ts, before POSTing to /api/transcribe
// Source: standard WAV encoding algorithm (verified against multiple sources)
function encodeWav(samples: Float32Array, sampleRate = 16000): Blob {
const numSamples = samples.length;
const buffer = new ArrayBuffer(44 + numSamples * 2);
const view = new DataView(buffer);
// RIFF chunk
writeString(view, 0, "RIFF");
view.setUint32(4, 36 + numSamples * 2, true);
writeString(view, 8, "WAVE");
// fmt sub-chunk
writeString(view, 12, "fmt ");
view.setUint32(16, 16, true); // PCM
view.setUint16(20, 1, true); // PCM = 1
view.setUint16(22, 1, true); // mono
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * 2, true); // byte rate
view.setUint16(32, 2, true); // block align
view.setUint16(34, 16, true); // bits per sample
// data sub-chunk
writeString(view, 36, "data");
view.setUint32(40, numSamples * 2, true);
let offset = 44;
for (let i = 0; i < numSamples; i++) {
const s = Math.max(-1, Math.min(1, samples[i]));
view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
offset += 2;
}
return new Blob([buffer], { type: "audio/wav" });
}
Pattern 7: POST /api/synthesize + playback
What: Send text to synthesis endpoint, receive WAV buffer, play with native audio When to use: ChatVoicePlayer when messageType is voice_full
async function playVoiceResponse(text: string, autoPlay: boolean) {
const res = await fetch("/api/synthesize", {
method: "POST",
headers: { "Content-Type": "application/json" },
credentials: "include",
body: JSON.stringify({ text }),
});
const blob = await res.blob();
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
if (autoPlay) audio.play();
// expose pause/play controls; revoke URL on ended
audio.addEventListener("ended", () => URL.revokeObjectURL(url));
}
Anti-Patterns to Avoid
- Calling useMicVAD with startOnLoad: true: Triggers immediate mic permission prompt on page load, not on user gesture. Always use
startOnLoad: falseand callvad.start()on mic button click. - Using AudioContext before user gesture: Browsers require AudioContext creation/resume inside a user interaction. Create it lazily in the click handler, not on component mount.
- Serving VAD assets from CDN with COEP require-corp: CDN resources lack CORP headers. Will cause COEP fetch errors. Always copy to
ui/public/and usebaseAssetPath: "/". - Not revoking blob URLs:
URL.createObjectURL()leaks memory if URLs are not revoked after use. - POSTing Float32Array directly to /api/transcribe: The transcribe endpoint expects
audio/webmoraudio/wavmultipart upload. Must encode Float32Array to WAV first.
Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| Silence detection | Custom silence timer with AudioWorklet | @ricky0123/vad-react | Silero VAD model; handles background noise, breath, plosives; 37 published versions |
| WAV encoding | Custom encoder | 44-line standard WAV encoder (see Pattern 6) | Not complex enough for a library; standard algorithm |
| Audio playback | Custom audio element abstraction | Native <audio> + URL.createObjectURL() |
Browser handles all codec/format negotiation |
| ONNX inference | Build ONNX runner | onnxruntime-web (peer dep of vad-web) | Already bundled |
Common Pitfalls
Pitfall 1: COEP blocks CDN asset loading
What goes wrong: After adding Cross-Origin-Embedder-Policy: require-corp, all cross-origin resources (Google Fonts, avatars from external URLs, CDN assets) are blocked unless they send Cross-Origin-Resource-Policy: cross-origin. Existing chat images from /api/assets/ (same-origin) are fine, but any externally hosted content breaks.
Why it happens: COEP require-corp enforces CORP on all sub-resources.
How to avoid: Serve all VAD ONNX/worklet assets from ui/public/ (same-origin). Audit for any cross-origin resource loads in existing chat components before adding headers.
Warning signs: Console errors: "COEP blocked cross-origin resource" for non-audio assets.
Pitfall 2: AudioContext suspended due to autoplay policy
What goes wrong: AudioContext.state === "suspended" prevents AnalyserNode from producing data; waveform is all zeros.
Why it happens: Browsers require AudioContext to be created or resumed inside a user gesture (click/tap).
How to avoid: Create new AudioContext() lazily inside the mic button click handler. If the context exists but is suspended, call context.resume() before starting recording.
Warning signs: Waveform canvas renders but all bars are flat (zero amplitude).
Pitfall 3: VAD model files not found
What goes wrong: useMicVAD throws or hangs with loading: true indefinitely; console shows 404 for .onnx or .worklet.bundle.min.js.
Why it happens: Default baseAssetPath may point to CDN; if COEP is active, CDN fetch is blocked. Or files were not copied to ui/public/.
How to avoid: Explicitly set baseAssetPath: "/" and onnxWASMBasePath: "/" in useMicVAD options. Verify files exist at ui/public/vad.worklet.bundle.min.js, ui/public/silero_vad_legacy.onnx, ui/public/silero_vad_v5.onnx after install.
Warning signs: vad.loading === true for more than 3 seconds; 404s in network tab.
Pitfall 4: voiceMode not passed through useStreamingChat
What goes wrong: Sending a voice_input message, the server doesn't set messageType: "voice_input" on the stored message, so ChatVoiceBadge never renders.
Why it happens: useStreamingChat.startStream() current signature is (userMessage: string, agentId?: string) — no voiceMode parameter. Chat.ts only sets messageType when voiceMode is in the request body.
How to avoid: Extend useStreamingChat.startStream() to accept voiceMode?: string and pass it in the fetch body to /api/conversations/${id}/stream.
Warning signs: Voice messages render as plain user text without the voice badge.
Pitfall 5: onSpeechEnd fires on very short utterances
What goes wrong: Background noise triggers onSpeechEnd with very short audio that produces garbage transcription.
Why it happens: VAD fires even for brief sounds if positiveThreshold is too low.
How to avoid: Set minSpeechFrames: 3 (minimum ~180ms) and positiveSpeechThreshold: 0.8 to filter noise. Display a "Too short" toast if the returned text is empty or < 2 chars.
Warning signs: Empty transcriptions appearing in chat; fast repeated submissions.
Pitfall 6: Phase 36 deliverables not present in working branch
What goes wrong: Building ChatInput voice integration before server/src/routes/voice.ts, server/src/services/nexus-settings.ts voiceMode schema, and voiceMode in createMessageSchema are present causes compile errors and missing endpoints.
Why it happens: Only Phase 36 Task 1 (VoicePipelineService) is on the current branch.
How to avoid: Wave 0 must cherry-pick or re-implement Phase 36 Tasks 2-3 commits before any Phase 37 implementation work. Verify GET /api/transcribe and GET /api/synthesize return 200 before proceeding.
Code Examples
VoiceMicButton state machine
// Source: 37-UI-SPEC.md + useMicVAD API docs
type RecordState = "idle" | "recording" | "processing";
function VoiceMicButton({ onTranscript }: { onTranscript: (text: string) => void }) {
const [state, setState] = useState<RecordState>("idle");
const vad = useMicVAD({
startOnLoad: false,
baseAssetPath: "/",
onnxWASMBasePath: "/",
onSpeechEnd: async (audio: Float32Array) => {
vad.pause();
setState("processing");
const wav = encodeWav(audio);
const form = new FormData();
form.append("audio", wav, "recording.wav");
const res = await fetch("/api/transcribe", {
method: "POST", credentials: "include", body: form,
});
const { text } = await res.json() as { text: string };
if (text?.trim()) onTranscript(text.trim());
setState("idle");
},
});
const handleClick = () => {
if (state === "idle") { vad.start(); setState("recording"); }
else if (state === "recording") { vad.pause(); setState("idle"); }
};
if (state === "processing") return <Button disabled><Loader2 className="h-4 w-4 animate-spin" /></Button>;
if (state === "recording") return (
<Button className="ring-2 ring-primary" onClick={handleClick} aria-label="Recording — speak now">
<VoiceWaveform listening={vad.listening} />
</Button>
);
return <Button onClick={handleClick} aria-label="Start voice input"><Mic className="h-4 w-4" /></Button>;
}
ChatVoiceBadge (voice_full expand/collapse)
// Source: 37-UI-SPEC.md; uses shadcn Collapsible (already installed)
import { Collapsible, CollapsibleContent, CollapsibleTrigger } from "@/components/ui/collapsible";
import { Badge } from "@/components/ui/badge";
function ChatVoiceBadge({ content, messageType }: { content: string; messageType: string }) {
const [open, setOpen] = useState(false);
const spokenMatch = content.match(/SPOKEN:\s*([\s\S]*?)(?=\nDETAILED:|$)/);
const spokenText = spokenMatch?.[1]?.trim() ?? content;
const detailedMatch = content.match(/DETAILED:\s*([\s\S]*)/);
return (
<div>
<Badge variant="outline" className="text-xs mb-2">Voice</Badge>
<p className="text-sm">{spokenText}</p>
{messageType === "voice_full" && detailedMatch && (
<Collapsible open={open} onOpenChange={setOpen}>
<CollapsibleTrigger className="text-xs text-muted-foreground hover:text-foreground mt-1">
{open ? "Hide full response" : "Show full response"}
</CollapsibleTrigger>
<CollapsibleContent>
<MarkdownBody className="text-sm mt-2">{detailedMatch[1].trim()}</MarkdownBody>
</CollapsibleContent>
</Collapsible>
)}
</div>
);
}
State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| WebRTC VAD polyfill | Silero VAD via ONNX + AudioWorklet | 2023-2024 | Dramatically better accuracy; handles noisy environments |
| MediaRecorder → manual silence timer | @ricky0123/vad-react onSpeechEnd | 2023 | Eliminates timer tuning; model-based accuracy |
| Flash/plugin audio playback | Native <audio> + Web Audio API |
2015+ | Universal; no plugin required |
| Custom waveform libraries | Web Audio API AnalyserNode | Always | Zero dependency; 30fps canvas |
Deprecated/outdated:
annyang,artyom.js: Web Speech API wrappers — browser-only, privacy concerns, no offline support- Manual silence detection with
onaudioprocess: Deprecated ScriptProcessor API; replaced by AudioWorklet - MediaRecorder direct upload (VoiceRecordButton v1): Manual stop only; no auto-silence — replaced by useVadRecorder
Open Questions
-
autoPlay persistence: nexus-settings vs localStorage
- What we know: nexus-settings already has voiceMode field. autoPlay (WCHAT-06) is a separate user preference.
- What's unclear: Should autoPlay live in nexus-settings (persisted server-side, works across devices) or localStorage (client-only, simpler)?
- Recommendation: Use localStorage key
nexus:voice:autoplayfor autoPlay — it is a per-device UX preference that doesn't need server-side persistence. Keeps nexus-settings lean.
-
COEP impact on existing cross-origin resources
- What we know: COEP
require-corpblocks cross-origin resources without CORP header. - What's unclear: Do existing Chat UI components load any cross-origin images (avatar CDN, external URLs in messages)?
- Recommendation: Audit
ui/src/components/ChatMessage.tsxandIdentity.tsxfor external image src. If any exist, usecredentiallessinstead ofrequire-corpfor COEP — this relaxes the restriction while still enabling SharedArrayBuffer in Chromium 96+. MEDIUM confidence — Firefox may not supportcredentiallessmode.
- What we know: COEP
-
VAD false-positive rate in quiet environments
- What we know: Silero VAD default thresholds are tuned for speech.
- What's unclear: In near-silent environments, keyboard noise or mouse clicks may trigger onSpeechEnd.
- Recommendation: Use
minSpeechFrames: 5(300ms minimum) and add aminSpeechFrames: 5safety gate. Show "Too short, try again" toast if transcript is empty.
Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|---|---|---|---|---|
| Node.js | build + tests | ✓ | v20.20.2 | — |
| pnpm | package install | ✓ | 9.15.4 | — |
| @ricky0123/vad-react | WCHAT-02 | ✗ (not installed) | — | Must install via pnpm |
| @ricky0123/vad-web | peer of vad-react | ✗ (not installed) | — | Installed automatically |
| onnxruntime-web | peer of vad-web | ✗ (not installed) | — | Installed automatically |
| Phase 36 Task 2-3 deliverables | All voice routes | ✗ (not on branch) | — | Wave 0 must cherry-pick or re-implement |
Missing dependencies with no fallback:
- @ricky0123/vad-react — must be installed (
pnpm add @ricky0123/vad-react --filter @paperclipai/ui) - Phase 36 server-side deliverables — POST /api/transcribe, POST /api/synthesize, nexus-settings voiceMode
Missing dependencies with fallback:
- None
Validation Architecture
Test Framework
| Property | Value |
|---|---|
| Framework | vitest ^3.0.5 |
| Config file | ui/vitest.config.ts |
| Quick run command | pnpm --filter @paperclipai/ui test --run |
| Full suite command | pnpm test --run |
Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|---|---|---|---|---|
| WCHAT-01 | VoiceMicButton renders idle/recording/processing states | unit | pnpm --filter @paperclipai/ui test --run -- VoiceMicButton |
❌ Wave 0 |
| WCHAT-02 | useVadRecorder calls onTranscript after onSpeechEnd fires | unit | pnpm --filter @paperclipai/ui test --run -- useVadRecorder |
❌ Wave 0 |
| WCHAT-03 | VoiceWaveform renders canvas with correct dimensions | unit | pnpm --filter @paperclipai/ui test --run -- VoiceWaveform |
❌ Wave 0 |
| WCHAT-04 | ChatVoicePlayer renders play button; auto-plays when autoPlay=true | unit | pnpm --filter @paperclipai/ui test --run -- ChatVoicePlayer |
❌ Wave 0 |
| WCHAT-05 | VoiceModeToggle renders three pills; click updates mode | unit | pnpm --filter @paperclipai/ui test --run -- VoiceModeToggle |
❌ Wave 0 |
| WCHAT-06 | useVoiceMode persists mode to nexus-settings; loads on mount | unit | pnpm --filter @paperclipai/ui test --run -- useVoiceMode |
❌ Wave 0 |
| WCHAT-01,02 | POST /api/transcribe returns { text } for WAV upload | unit (server) | pnpm --filter @paperclipai/server test --run -- voice-routes |
❌ (Phase 36 Task 3 — verify present) |
| WCHAT-04 | POST /api/synthesize returns audio/wav for text input | unit (server) | pnpm --filter @paperclipai/server test --run -- voice-routes |
❌ (Phase 36 Task 3 — verify present) |
| WCHAT-03 | encodeWav produces valid 44-byte WAV header | unit | pnpm --filter @paperclipai/ui test --run -- encodeWav |
❌ Wave 0 |
Note: UI tests use // @vitest-environment jsdom at the top of test files (see ChatInput.test.tsx pattern). All voice component tests must include this directive.
Sampling Rate
- Per task commit:
pnpm --filter @paperclipai/ui test --run - Per wave merge:
pnpm test --run - Phase gate: Full suite green before
/gsd:verify-work
Wave 0 Gaps
ui/src/components/VoiceMicButton.test.tsx— covers WCHAT-01ui/src/hooks/useVadRecorder.test.ts— covers WCHAT-02ui/src/components/VoiceWaveform.test.tsx— covers WCHAT-03ui/src/components/ChatVoicePlayer.test.tsx— covers WCHAT-04ui/src/components/VoiceModeToggle.test.tsx— covers WCHAT-05ui/src/hooks/useVoiceMode.test.ts— covers WCHAT-06ui/src/lib/encodeWav.test.ts— covers WAV encoding utility- Verify
server/src/routes/voice.tspresent (Phase 36 Task 3) - Verify
server/src/services/nexus-settings.tshas voiceMode (Phase 36 Task 2)
Sources
Primary (HIGH confidence)
- npm registry —
@ricky0123/vad-react@0.0.36,@ricky0123/vad-web@0.0.30,onnxruntime-web@1.24.3versions verified 2026-04-03 - https://docs.vad.ricky0123.com/user-guide/api/ — useMicVAD API properties (listening, loading, errored, userSpeaking, start, pause, onSpeechEnd, baseAssetPath, onnxWASMBasePath)
- MDN Web Audio API AnalyserNode documentation — waveform pattern
- 37-UI-SPEC.md (committed in
a0103337) — component inventory, interaction states, copywriting contract - 37-CONTEXT.md (committed in
30708d38) — implementation decisions
Secondary (MEDIUM confidence)
- Git history analysis (
fd372eaf,d0d7a23a,11508547) — Phase 36 deliverable status - https://web.dev/articles/coop-coep — COOP/COEP header semantics
- Vite docs —
server.headersfor dev server COOP/COEP
Tertiary (LOW confidence)
- COEP
credentiallessalternative (open question #2) — browser support needs verification
Metadata
Confidence breakdown:
- Standard stack: HIGH — npm registry confirmed versions; vad-react API verified from official docs
- Architecture: HIGH — derived from 37-UI-SPEC.md (committed) + existing codebase patterns
- Pitfalls: HIGH — based on verified browser behaviour (autoplay policy, COEP); LOW for pitfall #3 (threshold tuning is empirical)
- Branch status: HIGH — verified via
git log --all --oneline+git showof specific commits
Research date: 2026-04-03 Valid until: 2026-05-03 (stable APIs; vad-react hasn't released a major version since 2023)