Nexus Dev d4caf8f0da docs(37): phase research — VAD, COOP/COEP, component architecture

2026-04-04 02:07:19 +00:00

30 KiB

Raw Blame History

Phase 37: Web Chat Voice UI - Research

Researched: 2026-04-03 Domain: Browser voice I/O — VAD, MediaRecorder, Web Audio API, waveform visualization, audio playback, COOP/COEP headers Confidence: HIGH

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

All implementation choices are at Claude's discretion — discuss phase was skipped per user setting.

Claude's Discretion

All implementation details. Use ROADMAP phase goal, success criteria, and codebase conventions.

Key research findings baked into context:

@ricky0123/vad-react ^0.0.36 for browser-side silence detection (VAD)
COOP/COEP headers required on Express server for SharedArrayBuffer
Waveform via Web Audio API AnalyserNode (Canvas or SVG, 30-50 data points)
Native <audio> element + URL.createObjectURL() for playback
Three-state voice mode: "text" | "voice_input" | "full_voice"
VoiceMicButton replaces/enhances existing VoiceRecordButton
Voice badge + expandable markdown section in ChatMessage

Deferred Ideas (OUT OF SCOPE)

None — discuss phase skipped. </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
WCHAT-01	Mic button in chat input starts/stops voice recording with visual state (idle/recording/processing)	VoiceMicButton replaces VoiceRecordButton; three-state via recording/userSpeaking/loading from useMicVAD
WCHAT-02	Recording auto-stops on silence detection via VAD	useMicVAD onSpeechEnd callback fires automatically after 1.5s silence; no manual stop needed
WCHAT-03	Real-time waveform/amplitude visualization displays while recording	VoiceWaveform canvas component using Web Audio API AnalyserNode + requestAnimationFrame
WCHAT-04	Voice response audio plays inline in chat message with audio player controls	ChatVoicePlayer with native `<audio>` + URL.createObjectURL(); POST /api/synthesize → blob
WCHAT-05	User can toggle voice mode: text only / voice input only / full voice (input + output)	VoiceModeToggle three-pill component; persists to nexus-settings voiceMode field
WCHAT-06	Auto-play of voice responses is configurable (on/off in settings)	autoPlay flag in nexus-settings or localStorage; ChatVoicePlayer reads it on mount
</phase_requirements>

Summary

Phase 37 adds browser-based voice I/O to the existing web chat. Phase 36 delivered the server-side pipeline (VoicePipelineService, POST /api/transcribe, POST /api/synthesize, voiceMode wiring in chat.ts) and the nexus-settings schema extension. Phase 37 is entirely a frontend phase with one server-side addition: COOP/COEP response headers on the Express static middleware.

The central library is @ricky0123/vad-react ^0.0.36, which wraps Silero VAD running in an AudioWorklet. It requires the page to be cross-origin isolated (COOP + COEP headers) to use SharedArrayBuffer. The package ships ONNX model files and a worklet bundle that must either be served locally from public/ or loaded from its default CDN URLs. The CDN default is simpler and acceptable for development; production should serve them locally.

Waveform visualization uses a standard Web Audio API AnalyserNode pattern: connect the microphone stream → AnalyserNode → read Uint8Array in requestAnimationFrame loop → render bars on a <canvas>. This is entirely in-browser with no extra library. Audio playback for synthesized responses uses the native <audio> HTML element with URL.createObjectURL() from a Blob received from POST /api/synthesize.

Primary recommendation: Install @ricky0123/vad-react, add COOP/COEP headers to Express static/vite-dev middleware, serve VAD assets from ui/public/, build five new components + two hooks as specified in 37-UI-SPEC.md, extend ChatInput + ChatMessage, wire voiceMode through useStreamingChat.

Branch Context (Critical)

The current worktree branch (gsd/phase-35-npx-buildthis-cli) has only Phase 36 Task 1 committed (VoicePipelineService). The remaining Phase 36 deliverables live on a separate branch not yet merged:

Phase 36 Deliverable	Git Commit	Status in Current Branch
VoicePipelineService	`0ed912c2`	PRESENT
nexus-settings voiceMode schema	`d0d7a23a`	ABSENT — must be built in 37 Wave 0 or assumed present
voiceMode in createMessageSchema	`b964c0e4`	ABSENT
POST /api/transcribe, POST /api/synthesize routes	`11508547`	ABSENT
voiceMode wiring in chat.ts stream route	`fd372eaf`	ABSENT

Implication for planning: Wave 0 of Phase 37 must either (a) merge/cherry-pick the Phase 36 remainder, or (b) re-implement those 3 deliverables before building Phase 37 UI. The plan should treat Phase 36 tasks 2-3 as Wave 0 prerequisites and verify them before proceeding.

The ChatInput.tsx, ChatMessage.tsx, VoiceRecordButton.tsx, and related UI components exist on the parent branch PAP-878-create-a-mine-tab-in-inbox but NOT in the current worktree. The plan must account for these being the integration targets.

Standard Stack

Core

Library	Version	Purpose	Why Standard
`@ricky0123/vad-react`	`^0.0.36`	Browser VAD with 1.5s silence auto-stop	Specified in 37-UI-SPEC.md; only mature browser-side VAD library
`@ricky0123/vad-web`	`0.0.30` (peer)	VAD engine (AudioWorklet + Silero ONNX)	Peer dep of vad-react
`onnxruntime-web`	`^1.17.0` (peer)	ONNX runtime for Silero model	Required by vad-web
Web Audio API	browser built-in	AnalyserNode for waveform bars	Zero bundle cost; already in browser
Native `<audio>`	browser built-in	Playback of synthesized WAV	No extra library needed

Supporting

Library	Version	Purpose	When to Use
lucide-react	`^0.574.0` (already in ui/)	`Mic`, `Square`, `Loader2`, `Volume2`, `Play`, `Pause` icons	Voice button states + audio player
shadcn/ui `Badge`	already installed	Voice badge on agent messages	ChatVoiceBadge component
shadcn/ui `Collapsible`	already installed	Expand/collapse full markdown in voice_full messages	ChatVoiceBadge expand section

Alternatives Considered

Instead of	Could Use	Tradeoff
@ricky0123/vad-react	Manual silence detection with AudioWorklet	Much more complex; vad-react is the defacto standard
Canvas waveform	SVG bars	Canvas performs better for 30fps animation
Native `<audio>` + blob URL	Howler.js	No extra dependency; native handles WAV fine

Installation:

pnpm add @ricky0123/vad-react --filter @paperclipai/ui

Version verification (confirmed against npm registry 2026-04-03):

@ricky0123/vad-react: 0.0.36 (latest)
@ricky0123/vad-web: 0.0.30 (peer dependency, installed automatically)
onnxruntime-web: 1.24.3 (latest; ^1.17.0 from vad-web is satisfied)

Architecture Patterns

Recommended Project Structure

ui/src/
├── components/
│   ├── VoiceMicButton.tsx       # Replaces VoiceRecordButton — VAD + waveform + three states
│   ├── VoiceWaveform.tsx        # Canvas amplitude bars (30-50 points, 32px tall)
│   ├── VoiceModeToggle.tsx      # Three-pill: Text / Voice In / Full Voice
│   ├── ChatVoicePlayer.tsx      # Inline audio player with play/pause/progress
│   └── ChatVoiceBadge.tsx       # "Voice" badge + collapsible full markdown
├── hooks/
│   ├── useVadRecorder.ts        # Wraps useMicVAD; exposes Float32Array on speech end
│   └── useVoiceMode.ts          # Reads/writes voiceMode from nexus-settings
ui/public/
│   ├── vad.worklet.bundle.min.js   # From @ricky0123/vad-web/dist/
│   ├── silero_vad_legacy.onnx      # From @ricky0123/vad-web/dist/
│   └── silero_vad_v5.onnx          # From @ricky0123/vad-web/dist/
server/src/
└── app.ts  (add COOP/COEP headers middleware)

Pattern 1: useMicVAD from @ricky0123/vad-react

What: Hook that runs Silero VAD in an AudioWorklet; fires onSpeechEnd(audio: Float32Array) after silence When to use: VoiceMicButton and useVadRecorder hook

// Source: https://docs.vad.ricky0123.com/user-guide/api/
import { useMicVAD } from "@ricky0123/vad-react";

const vad = useMicVAD({
  startOnLoad: false,            // user must click mic button first
  onSpeechEnd: (audio: Float32Array) => {
    // audio is Float32Array at 16kHz
    // Convert to WAV blob and POST to /api/transcribe
  },
  onSpeechStart: () => { /* update waveform active state */ },
  positiveSpeechThreshold: 0.8,
  negativeSpeechThreshold: 0.8 - 0.15,
  redemptionFrames: 8,           // ~480ms silence before speech_end
  baseAssetPath: "/",            // serve from ui/public/
  onnxWASMBasePath: "/",
});

// Returned: { listening, loading, errored, userSpeaking, start, pause }

Audio conversion for upload:

// Float32Array → WAV blob (16kHz, mono, 16-bit PCM)
function float32ToWav(samples: Float32Array, sampleRate = 16000): Blob {
  const buffer = new ArrayBuffer(44 + samples.length * 2);
  const view = new DataView(buffer);
  // WAV header...
  return new Blob([buffer], { type: "audio/wav" });
}

Pattern 2: Web Audio API AnalyserNode for waveform

What: Connect MediaStream to AnalyserNode; poll getByteFrequencyData in rAF loop When to use: VoiceWaveform component (only while recording)

// Source: MDN Web Audio API docs
const audioCtx = new AudioContext();
const analyser = audioCtx.createAnalyser();
analyser.fftSize = 64;          // 32 frequency bins
const source = audioCtx.createMediaStreamSource(stream);
source.connect(analyser);
const dataArray = new Uint8Array(analyser.frequencyBinCount); // 32 bars

function draw() {
  animRef.current = requestAnimationFrame(draw);
  analyser.getByteFrequencyData(dataArray);
  // render bars to canvas
}

Pattern 3: COOP/COEP headers in Express

What: Cross-origin isolation required for SharedArrayBuffer (used by AudioWorklet/ONNX) When to use: All static responses and Vite dev server

// Source: MDN - Cross-Origin Isolation
// In server/src/app.ts, before static/vite middleware:
app.use((_req, res, next) => {
  res.setHeader("Cross-Origin-Opener-Policy", "same-origin");
  res.setHeader("Cross-Origin-Embedder-Policy", "require-corp");
  next();
});

// For Vite dev in vite.config.ts:
server: {
  headers: {
    "Cross-Origin-Opener-Policy": "same-origin",
    "Cross-Origin-Embedder-Policy": "require-corp",
  },
},

Critical: COEP require-corp means all cross-origin resources must opt-in with CORP headers. CDN-hosted VAD assets load via AudioWorklet (same-origin) so this is only a concern for user-loaded images. Serve VAD assets from ui/public/ (same-origin) to avoid CORP issues entirely.

Pattern 4: VAD asset setup (Vite)

What: Copy ONNX + worklet files to public/ so they are served at root When to use: Build setup task

# After pnpm install, copy from node_modules:
cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js ui/public/
cp node_modules/@ricky0123/vad-web/dist/silero_vad_legacy.onnx ui/public/
cp node_modules/@ricky0123/vad-web/dist/silero_vad_v5.onnx ui/public/

Alternatively, add a vite-plugin-static-copy or script in package.json prepare:

"scripts": {
  "copy-vad-assets": "cp node_modules/@ricky0123/vad-web/dist/vad.worklet.bundle.min.js public/ && cp node_modules/@ricky0123/vad-web/dist/*.onnx public/"
}

Pattern 5: useVoiceMode hook

What: Reads voiceMode from GET /api/nexus-settings, writes via PATCH When to use: VoiceModeToggle component; ChatPanel to pass voiceMode to stream call

// Source: existing nexus-settings pattern in codebase
type VoiceMode = "text" | "voice_input" | "full_voice";

export function useVoiceMode() {
  const [mode, setMode] = useState<VoiceMode>("text");
  // Load on mount via GET /api/nexus-settings
  // PATCH on change
  return { mode, setMode: async (next: VoiceMode) => { ... } };
}

Pattern 6: Float32Array → WAV Blob

What: Convert vad-react onSpeechEnd Float32Array (16kHz) to WAV for upload When to use: useVadRecorder.ts, before POSTing to /api/transcribe

// Source: standard WAV encoding algorithm (verified against multiple sources)
function encodeWav(samples: Float32Array, sampleRate = 16000): Blob {
  const numSamples = samples.length;
  const buffer = new ArrayBuffer(44 + numSamples * 2);
  const view = new DataView(buffer);
  // RIFF chunk
  writeString(view, 0, "RIFF");
  view.setUint32(4, 36 + numSamples * 2, true);
  writeString(view, 8, "WAVE");
  // fmt sub-chunk
  writeString(view, 12, "fmt ");
  view.setUint32(16, 16, true);  // PCM
  view.setUint16(20, 1, true);   // PCM = 1
  view.setUint16(22, 1, true);   // mono
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * 2, true);  // byte rate
  view.setUint16(32, 2, true);   // block align
  view.setUint16(34, 16, true);  // bits per sample
  // data sub-chunk
  writeString(view, 36, "data");
  view.setUint32(40, numSamples * 2, true);
  let offset = 44;
  for (let i = 0; i < numSamples; i++) {
    const s = Math.max(-1, Math.min(1, samples[i]));
    view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
    offset += 2;
  }
  return new Blob([buffer], { type: "audio/wav" });
}

Pattern 7: POST /api/synthesize + playback

What: Send text to synthesis endpoint, receive WAV buffer, play with native audio When to use: ChatVoicePlayer when messageType is voice_full

async function playVoiceResponse(text: string, autoPlay: boolean) {
  const res = await fetch("/api/synthesize", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    credentials: "include",
    body: JSON.stringify({ text }),
  });
  const blob = await res.blob();
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  if (autoPlay) audio.play();
  // expose pause/play controls; revoke URL on ended
  audio.addEventListener("ended", () => URL.revokeObjectURL(url));
}

Anti-Patterns to Avoid

Calling useMicVAD with startOnLoad: true: Triggers immediate mic permission prompt on page load, not on user gesture. Always use startOnLoad: false and call vad.start() on mic button click.
Using AudioContext before user gesture: Browsers require AudioContext creation/resume inside a user interaction. Create it lazily in the click handler, not on component mount.
Serving VAD assets from CDN with COEP require-corp: CDN resources lack CORP headers. Will cause COEP fetch errors. Always copy to ui/public/ and use baseAssetPath: "/".
Not revoking blob URLs: URL.createObjectURL() leaks memory if URLs are not revoked after use.
POSTing Float32Array directly to /api/transcribe: The transcribe endpoint expects audio/webm or audio/wav multipart upload. Must encode Float32Array to WAV first.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Silence detection	Custom silence timer with AudioWorklet	@ricky0123/vad-react	Silero VAD model; handles background noise, breath, plosives; 37 published versions
WAV encoding	Custom encoder	44-line standard WAV encoder (see Pattern 6)	Not complex enough for a library; standard algorithm
Audio playback	Custom audio element abstraction	Native `<audio>` + URL.createObjectURL()	Browser handles all codec/format negotiation
ONNX inference	Build ONNX runner	onnxruntime-web (peer dep of vad-web)	Already bundled

Common Pitfalls

Pitfall 1: COEP blocks CDN asset loading

What goes wrong: After adding Cross-Origin-Embedder-Policy: require-corp, all cross-origin resources (Google Fonts, avatars from external URLs, CDN assets) are blocked unless they send Cross-Origin-Resource-Policy: cross-origin. Existing chat images from /api/assets/ (same-origin) are fine, but any externally hosted content breaks. Why it happens: COEP require-corp enforces CORP on all sub-resources. How to avoid: Serve all VAD ONNX/worklet assets from ui/public/ (same-origin). Audit for any cross-origin resource loads in existing chat components before adding headers. Warning signs: Console errors: "COEP blocked cross-origin resource" for non-audio assets.

Pitfall 2: AudioContext suspended due to autoplay policy

What goes wrong: AudioContext.state === "suspended" prevents AnalyserNode from producing data; waveform is all zeros. Why it happens: Browsers require AudioContext to be created or resumed inside a user gesture (click/tap). How to avoid: Create new AudioContext() lazily inside the mic button click handler. If the context exists but is suspended, call context.resume() before starting recording. Warning signs: Waveform canvas renders but all bars are flat (zero amplitude).

Pitfall 3: VAD model files not found

What goes wrong: useMicVAD throws or hangs with loading: true indefinitely; console shows 404 for .onnx or .worklet.bundle.min.js. Why it happens: Default baseAssetPath may point to CDN; if COEP is active, CDN fetch is blocked. Or files were not copied to ui/public/. How to avoid: Explicitly set baseAssetPath: "/" and onnxWASMBasePath: "/" in useMicVAD options. Verify files exist at ui/public/vad.worklet.bundle.min.js, ui/public/silero_vad_legacy.onnx, ui/public/silero_vad_v5.onnx after install. Warning signs: vad.loading === true for more than 3 seconds; 404s in network tab.

Pitfall 4: voiceMode not passed through useStreamingChat

What goes wrong: Sending a voice_input message, the server doesn't set messageType: "voice_input" on the stored message, so ChatVoiceBadge never renders. Why it happens: useStreamingChat.startStream() current signature is (userMessage: string, agentId?: string) — no voiceMode parameter. Chat.ts only sets messageType when voiceMode is in the request body. How to avoid: Extend useStreamingChat.startStream() to accept voiceMode?: string and pass it in the fetch body to /api/conversations/${id}/stream. Warning signs: Voice messages render as plain user text without the voice badge.

Pitfall 5: onSpeechEnd fires on very short utterances

What goes wrong: Background noise triggers onSpeechEnd with very short audio that produces garbage transcription. Why it happens: VAD fires even for brief sounds if positiveThreshold is too low. How to avoid: Set minSpeechFrames: 3 (minimum ~180ms) and positiveSpeechThreshold: 0.8 to filter noise. Display a "Too short" toast if the returned text is empty or < 2 chars. Warning signs: Empty transcriptions appearing in chat; fast repeated submissions.

Pitfall 6: Phase 36 deliverables not present in working branch

What goes wrong: Building ChatInput voice integration before server/src/routes/voice.ts, server/src/services/nexus-settings.ts voiceMode schema, and voiceMode in createMessageSchema are present causes compile errors and missing endpoints. Why it happens: Only Phase 36 Task 1 (VoicePipelineService) is on the current branch. How to avoid: Wave 0 must cherry-pick or re-implement Phase 36 Tasks 2-3 commits before any Phase 37 implementation work. Verify GET /api/transcribe and GET /api/synthesize return 200 before proceeding.

Code Examples

VoiceMicButton state machine

// Source: 37-UI-SPEC.md + useMicVAD API docs
type RecordState = "idle" | "recording" | "processing";

function VoiceMicButton({ onTranscript }: { onTranscript: (text: string) => void }) {
  const [state, setState] = useState<RecordState>("idle");
  const vad = useMicVAD({
    startOnLoad: false,
    baseAssetPath: "/",
    onnxWASMBasePath: "/",
    onSpeechEnd: async (audio: Float32Array) => {
      vad.pause();
      setState("processing");
      const wav = encodeWav(audio);
      const form = new FormData();
      form.append("audio", wav, "recording.wav");
      const res = await fetch("/api/transcribe", {
        method: "POST", credentials: "include", body: form,
      });
      const { text } = await res.json() as { text: string };
      if (text?.trim()) onTranscript(text.trim());
      setState("idle");
    },
  });

  const handleClick = () => {
    if (state === "idle") { vad.start(); setState("recording"); }
    else if (state === "recording") { vad.pause(); setState("idle"); }
  };

  if (state === "processing") return <Button disabled><Loader2 className="h-4 w-4 animate-spin" /></Button>;
  if (state === "recording") return (
    <Button className="ring-2 ring-primary" onClick={handleClick} aria-label="Recording — speak now">
      <VoiceWaveform listening={vad.listening} />
    </Button>
  );
  return <Button onClick={handleClick} aria-label="Start voice input"><Mic className="h-4 w-4" /></Button>;
}

ChatVoiceBadge (voice_full expand/collapse)

// Source: 37-UI-SPEC.md; uses shadcn Collapsible (already installed)
import { Collapsible, CollapsibleContent, CollapsibleTrigger } from "@/components/ui/collapsible";
import { Badge } from "@/components/ui/badge";

function ChatVoiceBadge({ content, messageType }: { content: string; messageType: string }) {
  const [open, setOpen] = useState(false);
  const spokenMatch = content.match(/SPOKEN:\s*([\s\S]*?)(?=\nDETAILED:|$)/);
  const spokenText = spokenMatch?.[1]?.trim() ?? content;
  const detailedMatch = content.match(/DETAILED:\s*([\s\S]*)/);

  return (
    <div>
      <Badge variant="outline" className="text-xs mb-2">Voice</Badge>
      <p className="text-sm">{spokenText}</p>
      {messageType === "voice_full" && detailedMatch && (
        <Collapsible open={open} onOpenChange={setOpen}>
          <CollapsibleTrigger className="text-xs text-muted-foreground hover:text-foreground mt-1">
            {open ? "Hide full response" : "Show full response"}
          </CollapsibleTrigger>
          <CollapsibleContent>
            <MarkdownBody className="text-sm mt-2">{detailedMatch[1].trim()}</MarkdownBody>
          </CollapsibleContent>
        </Collapsible>
      )}
    </div>
  );
}

State of the Art

Old Approach	Current Approach	When Changed	Impact
WebRTC VAD polyfill	Silero VAD via ONNX + AudioWorklet	2023-2024	Dramatically better accuracy; handles noisy environments
MediaRecorder → manual silence timer	@ricky0123/vad-react onSpeechEnd	2023	Eliminates timer tuning; model-based accuracy
Flash/plugin audio playback	Native `<audio>` + Web Audio API	2015+	Universal; no plugin required
Custom waveform libraries	Web Audio API AnalyserNode	Always	Zero dependency; 30fps canvas

Deprecated/outdated:

annyang, artyom.js: Web Speech API wrappers — browser-only, privacy concerns, no offline support
Manual silence detection with onaudioprocess: Deprecated ScriptProcessor API; replaced by AudioWorklet
MediaRecorder direct upload (VoiceRecordButton v1): Manual stop only; no auto-silence — replaced by useVadRecorder

Open Questions

autoPlay persistence: nexus-settings vs localStorage
- What we know: nexus-settings already has voiceMode field. autoPlay (WCHAT-06) is a separate user preference.
- What's unclear: Should autoPlay live in nexus-settings (persisted server-side, works across devices) or localStorage (client-only, simpler)?
- Recommendation: Use localStorage key nexus:voice:autoplay for autoPlay — it is a per-device UX preference that doesn't need server-side persistence. Keeps nexus-settings lean.
COEP impact on existing cross-origin resources
- What we know: COEP require-corp blocks cross-origin resources without CORP header.
- What's unclear: Do existing Chat UI components load any cross-origin images (avatar CDN, external URLs in messages)?
- Recommendation: Audit ui/src/components/ChatMessage.tsx and Identity.tsx for external image src. If any exist, use credentialless instead of require-corp for COEP — this relaxes the restriction while still enabling SharedArrayBuffer in Chromium 96+. MEDIUM confidence — Firefox may not support credentialless mode.
VAD false-positive rate in quiet environments
- What we know: Silero VAD default thresholds are tuned for speech.
- What's unclear: In near-silent environments, keyboard noise or mouse clicks may trigger onSpeechEnd.
- Recommendation: Use minSpeechFrames: 5 (300ms minimum) and add a minSpeechFrames: 5 safety gate. Show "Too short, try again" toast if transcript is empty.

Environment Availability

Dependency	Required By	Available	Version	Fallback
Node.js	build + tests	✓	v20.20.2	—
pnpm	package install	✓	9.15.4	—
@ricky0123/vad-react	WCHAT-02	✗ (not installed)	—	Must install via pnpm
@ricky0123/vad-web	peer of vad-react	✗ (not installed)	—	Installed automatically
onnxruntime-web	peer of vad-web	✗ (not installed)	—	Installed automatically
Phase 36 Task 2-3 deliverables	All voice routes	✗ (not on branch)	—	Wave 0 must cherry-pick or re-implement

Missing dependencies with no fallback:

@ricky0123/vad-react — must be installed (pnpm add @ricky0123/vad-react --filter @paperclipai/ui)
Phase 36 server-side deliverables — POST /api/transcribe, POST /api/synthesize, nexus-settings voiceMode

Missing dependencies with fallback:

None

Validation Architecture

Test Framework

Property	Value
Framework	vitest ^3.0.5
Config file	`ui/vitest.config.ts`
Quick run command	`pnpm --filter @paperclipai/ui test --run`
Full suite command	`pnpm test --run`

Phase Requirements → Test Map

Req ID	Behavior	Test Type	Automated Command	File Exists?
WCHAT-01	VoiceMicButton renders idle/recording/processing states	unit	`pnpm --filter @paperclipai/ui test --run -- VoiceMicButton`	❌ Wave 0
WCHAT-02	useVadRecorder calls onTranscript after onSpeechEnd fires	unit	`pnpm --filter @paperclipai/ui test --run -- useVadRecorder`	❌ Wave 0
WCHAT-03	VoiceWaveform renders canvas with correct dimensions	unit	`pnpm --filter @paperclipai/ui test --run -- VoiceWaveform`	❌ Wave 0
WCHAT-04	ChatVoicePlayer renders play button; auto-plays when autoPlay=true	unit	`pnpm --filter @paperclipai/ui test --run -- ChatVoicePlayer`	❌ Wave 0
WCHAT-05	VoiceModeToggle renders three pills; click updates mode	unit	`pnpm --filter @paperclipai/ui test --run -- VoiceModeToggle`	❌ Wave 0
WCHAT-06	useVoiceMode persists mode to nexus-settings; loads on mount	unit	`pnpm --filter @paperclipai/ui test --run -- useVoiceMode`	❌ Wave 0
WCHAT-01,02	POST /api/transcribe returns { text } for WAV upload	unit (server)	`pnpm --filter @paperclipai/server test --run -- voice-routes`	❌ (Phase 36 Task 3 — verify present)
WCHAT-04	POST /api/synthesize returns audio/wav for text input	unit (server)	`pnpm --filter @paperclipai/server test --run -- voice-routes`	❌ (Phase 36 Task 3 — verify present)
WCHAT-03	encodeWav produces valid 44-byte WAV header	unit	`pnpm --filter @paperclipai/ui test --run -- encodeWav`	❌ Wave 0

Note: UI tests use // @vitest-environment jsdom at the top of test files (see ChatInput.test.tsx pattern). All voice component tests must include this directive.

Sampling Rate

Per task commit: pnpm --filter @paperclipai/ui test --run
Per wave merge: pnpm test --run
Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

ui/src/components/VoiceMicButton.test.tsx — covers WCHAT-01
ui/src/hooks/useVadRecorder.test.ts — covers WCHAT-02
ui/src/components/VoiceWaveform.test.tsx — covers WCHAT-03
ui/src/components/ChatVoicePlayer.test.tsx — covers WCHAT-04
ui/src/components/VoiceModeToggle.test.tsx — covers WCHAT-05
ui/src/hooks/useVoiceMode.test.ts — covers WCHAT-06
ui/src/lib/encodeWav.test.ts — covers WAV encoding utility
Verify server/src/routes/voice.ts present (Phase 36 Task 3)
Verify server/src/services/nexus-settings.ts has voiceMode (Phase 36 Task 2)

Sources

Primary (HIGH confidence)

npm registry — @ricky0123/vad-react@0.0.36, @ricky0123/vad-web@0.0.30, onnxruntime-web@1.24.3 versions verified 2026-04-03
https://docs.vad.ricky0123.com/user-guide/api/ — useMicVAD API properties (listening, loading, errored, userSpeaking, start, pause, onSpeechEnd, baseAssetPath, onnxWASMBasePath)
MDN Web Audio API AnalyserNode documentation — waveform pattern
37-UI-SPEC.md (committed in a0103337) — component inventory, interaction states, copywriting contract
37-CONTEXT.md (committed in 30708d38) — implementation decisions

Secondary (MEDIUM confidence)

Git history analysis (fd372eaf, d0d7a23a, 11508547) — Phase 36 deliverable status
https://web.dev/articles/coop-coep — COOP/COEP header semantics
Vite docs — server.headers for dev server COOP/COEP

Tertiary (LOW confidence)

COEP credentialless alternative (open question #2) — browser support needs verification

Metadata

Confidence breakdown:

Standard stack: HIGH — npm registry confirmed versions; vad-react API verified from official docs
Architecture: HIGH — derived from 37-UI-SPEC.md (committed) + existing codebase patterns
Pitfalls: HIGH — based on verified browser behaviour (autoplay policy, COEP); LOW for pitfall #3 (threshold tuning is empirical)
Branch status: HIGH — verified via git log --all --oneline + git show of specific commits

Research date: 2026-04-03 Valid until: 2026-05-03 (stable APIs; vad-react hasn't released a major version since 2023)

30 KiB Raw Blame History