nexus/.planning/phases/37-web-chat-voice-ui/37-02-PLAN.md

14 KiB

phase plan type wave depends_on files_modified autonomous requirements must_haves
37-web-chat-voice-ui 02 execute 2
37-01
ui/src/lib/encodeWav.ts
ui/src/hooks/useVadRecorder.ts
ui/src/hooks/useVoiceMode.ts
ui/src/components/VoiceWaveform.tsx
ui/src/components/VoiceMicButton.tsx
true
WCHAT-01
WCHAT-02
WCHAT-03
WCHAT-05
truths artifacts key_links
VoiceMicButton renders three visual states: idle (Mic icon), recording (waveform + ring), processing (Loader2 spinner)
Recording auto-stops on silence via VAD onSpeechEnd callback
VoiceWaveform renders animated canvas bars during recording
useVadRecorder converts Float32Array to WAV and POSTs to /api/transcribe
useVoiceMode reads voiceMode from GET /api/nexus/settings and writes via PATCH
path provides exports
ui/src/lib/encodeWav.ts Float32Array to WAV blob encoder
encodeWav
path provides exports
ui/src/hooks/useVadRecorder.ts VAD recording hook with auto-stop
useVadRecorder
path provides exports
ui/src/hooks/useVoiceMode.ts Voice mode state from nexus-settings
useVoiceMode
path provides exports
ui/src/components/VoiceWaveform.tsx Canvas amplitude visualization
VoiceWaveform
path provides exports
ui/src/components/VoiceMicButton.tsx VAD-powered mic button with three states
VoiceMicButton
from to via pattern
ui/src/components/VoiceMicButton.tsx ui/src/hooks/useVadRecorder.ts useVadRecorder() hook call useVadRecorder
from to via pattern
ui/src/hooks/useVadRecorder.ts ui/src/lib/encodeWav.ts encodeWav(audio) in onSpeechEnd encodeWav
from to via pattern
ui/src/hooks/useVadRecorder.ts /api/transcribe fetch POST with FormData fetch.*api/transcribe
from to via pattern
ui/src/components/VoiceMicButton.tsx ui/src/components/VoiceWaveform.tsx VoiceWaveform rendered inside recording state <VoiceWaveform
Build the core voice recording components: WAV encoder, VAD recorder hook, voice mode hook, waveform visualization, and the VoiceMicButton that ties them together.

Purpose: These are the foundational building blocks that replace VoiceRecordButton with VAD-powered auto-stop recording and real-time waveform visualization.

Output: 5 new files — encodeWav utility, useVadRecorder hook, useVoiceMode hook, VoiceWaveform component, VoiceMicButton component

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/phases/37-web-chat-voice-ui/37-RESEARCH.md @.planning/phases/37-web-chat-voice-ui/37-01-SUMMARY.md ```typescript // @ricky0123/vad-react useMicVAD hook import { useMicVAD } from "@ricky0123/vad-react"; const vad = useMicVAD({ startOnLoad: false, baseAssetPath: "/", onnxWASMBasePath: "/", positiveSpeechThreshold: 0.8, negativeSpeechThreshold: 0.65, redemptionFrames: 8, minSpeechFrames: 5, onSpeechStart: () => void, onSpeechEnd: (audio: Float32Array) => void, }); // Returns: { listening, loading, errored, userSpeaking, start, pause } ```
interface VoiceRecordButtonProps {
  onTranscription: (text: string) => void;
  disabled?: boolean;
}
GET  /api/nexus/settings → { mode, voiceEnabled, voiceMode, ... }
PATCH /api/nexus/settings → accepts partial, returns updated
Task 1: Create encodeWav utility and useVadRecorder + useVoiceMode hooks ui/src/lib/encodeWav.ts, ui/src/hooks/useVadRecorder.ts, ui/src/hooks/useVoiceMode.ts ui/src/hooks/useStreamingChat.ts, ui/src/api/chat.ts, ui/src/components/VoiceRecordButton.tsx 1. **ui/src/lib/encodeWav.ts** — Create WAV encoder function: ```typescript export function encodeWav(samples: Float32Array, sampleRate = 16000): Blob ``` - Standard 44-byte WAV header (RIFF/WAVE/fmt/data chunks) - PCM format (1), mono (1 channel), 16-bit depth - Clamp samples to [-1, 1] range before int16 conversion - Return Blob with type "audio/wav" - Helper: `function writeString(view: DataView, offset: number, str: string)`
  1. ui/src/hooks/useVadRecorder.ts — Create VAD recording hook:

    interface UseVadRecorderOptions {
      onTranscript: (text: string) => void;
    }
    interface UseVadRecorderReturn {
      state: "idle" | "recording" | "processing";
      start: () => void;
      stop: () => void;
      mediaStream: MediaStream | null; // exposed for VoiceWaveform AnalyserNode
    }
    export function useVadRecorder(opts: UseVadRecorderOptions): UseVadRecorderReturn
    

    Implementation:

    • Use useMicVAD from @ricky0123/vad-react with startOnLoad: false
    • Set baseAssetPath: "/" and onnxWASMBasePath: "/" (serve from ui/public/)
    • Set positiveSpeechThreshold: 0.8, minSpeechFrames: 5 (300ms minimum to filter noise)
    • In onSpeechEnd(audio: Float32Array): a. Call vad.pause() to stop listening b. Set state to "processing" c. Call encodeWav(audio) to get WAV blob d. Create FormData, append blob as "audio" field with filename "recording.wav" e. POST to /api/transcribe with credentials: "include" f. Parse response as { text: string } g. If text is non-empty (length >= 2), call opts.onTranscript(text.trim()) h. Set state back to "idle"
    • start(): calls vad.start(), sets state to "recording"
    • stop(): calls vad.pause(), sets state to "idle"
    • Expose mediaStream from navigator.mediaDevices.getUserMedia({ audio: true }) — store in a ref. This is needed for VoiceWaveform AnalyserNode.
    • NOTE: useMicVAD manages its own media stream internally, but VoiceWaveform needs a separate reference to the stream for the AnalyserNode. Request the stream in the start() function and store in a ref. Stop tracks in stop().
  2. ui/src/hooks/useVoiceMode.ts — Create voice mode hook:

    type VoiceMode = "text" | "voice_input" | "full_voice";
    interface UseVoiceModeReturn {
      mode: VoiceMode;
      setMode: (next: VoiceMode) => Promise<void>;
      isLoading: boolean;
    }
    export function useVoiceMode(): UseVoiceModeReturn
    

    Implementation:

    • On mount, GET /api/nexus/settings with credentials: "include"
    • Extract voiceMode from response, default to "text"
    • setMode(next): optimistically update local state, then PATCH /api/nexus/settings with { voiceMode: next }
    • Use useState for mode and isLoading
    • Wrap fetch in try/catch; on error, revert to previous mode cd /opt/nexus/.claude/worktrees/agent-a009558f && test -f ui/src/lib/encodeWav.ts && test -f ui/src/hooks/useVadRecorder.ts && test -f ui/src/hooks/useVoiceMode.ts && grep -q "encodeWav" ui/src/lib/encodeWav.ts && grep -q "useVadRecorder" ui/src/hooks/useVadRecorder.ts && grep -q "useVoiceMode" ui/src/hooks/useVoiceMode.ts && grep -q "useMicVAD" ui/src/hooks/useVadRecorder.ts && grep -q "api/transcribe" ui/src/hooks/useVadRecorder.ts && grep -q "api/nexus/settings" ui/src/hooks/useVoiceMode.ts && echo "PASS" || echo "FAIL" <acceptance_criteria>
    • grep "export function encodeWav" ui/src/lib/encodeWav.ts returns match
    • grep "export function useVadRecorder" ui/src/hooks/useVadRecorder.ts returns match
    • grep "export function useVoiceMode" ui/src/hooks/useVoiceMode.ts returns match
    • grep "useMicVAD" ui/src/hooks/useVadRecorder.ts returns match
    • grep "startOnLoad.*false" ui/src/hooks/useVadRecorder.ts returns match
    • grep "baseAssetPath" ui/src/hooks/useVadRecorder.ts returns match with "/"
    • grep "api/transcribe" ui/src/hooks/useVadRecorder.ts returns match
    • grep "api/nexus/settings" ui/src/hooks/useVoiceMode.ts returns match
    • grep "encodeWav" ui/src/hooks/useVadRecorder.ts returns match (imports it)
    • grep "RIFF" ui/src/lib/encodeWav.ts returns match (WAV header) </acceptance_criteria> encodeWav utility produces valid WAV blobs. useVadRecorder wraps useMicVAD with auto-stop + transcription. useVoiceMode reads/writes voiceMode from nexus-settings API.
Task 2: Create VoiceWaveform canvas component and VoiceMicButton ui/src/components/VoiceWaveform.tsx, ui/src/components/VoiceMicButton.tsx ui/src/hooks/useVadRecorder.ts, ui/src/lib/encodeWav.ts, ui/src/components/VoiceRecordButton.tsx 1. **ui/src/components/VoiceWaveform.tsx** — Canvas-based amplitude visualization: ```typescript interface VoiceWaveformProps { stream: MediaStream | null; active: boolean; // controls animation loop } export function VoiceWaveform({ stream, active }: VoiceWaveformProps) ``` Implementation: - Use a `` element, width=80, height=32 (h-8 per UI spec), className="inline-block" - On mount (when stream is truthy and active is true): a. Create AudioContext (lazily — only create once, store in ref) b. If AudioContext is suspended, call `audioCtx.resume()` c. Create MediaStreamSource from stream d. Create AnalyserNode with fftSize=64 (gives 32 frequency bins) e. Connect source -> analyser f. Start requestAnimationFrame loop: - Call `analyser.getByteFrequencyData(dataArray)` into Uint8Array(32) - Clear canvas - Draw 20 bars (skip every other bin for cleaner look): each bar width=2px, gap=2px - Bar height = (dataArray[i*2] / 255) * canvasHeight, minimum 2px - Bar color: use CSS variable --primary via getComputedStyle g. Store animationFrame id in ref for cleanup - On cleanup or when active becomes false: cancelAnimationFrame, disconnect source - Do NOT close AudioContext on cleanup (reuse across start/stop cycles)
  1. ui/src/components/VoiceMicButton.tsx — VAD-powered mic button:
    interface VoiceMicButtonProps {
      onTranscript: (text: string) => void;
      disabled?: boolean;
    }
    export function VoiceMicButton({ onTranscript, disabled }: VoiceMicButtonProps)
    
    Implementation:
    • Call useVadRecorder({ onTranscript }) to get { state, start, stop, mediaStream }
    • Three visual states per UI spec: a. idle (state === "idle"): Render Button with ghost variant, size="icon", h-8 w-8. Content: <Mic className="h-4 w-4" />. aria-label="Start voice input". onClick calls start(). b. recording (state === "recording"): Render Button with ghost variant, size="icon", h-8 w-8, with ring-2 ring-primary classes. Content: <VoiceWaveform stream={mediaStream} active={true} />. aria-label="Recording — speak now". onClick calls stop(). c. processing (state === "processing"): Render Button disabled, ghost variant, size="icon", h-8 w-8. Content: <Loader2 className="h-4 w-4 animate-spin" />. aria-label="Transcribing...".
    • Import Mic, Loader2 from lucide-react
    • Import Button from @/components/ui/button
    • Import VoiceWaveform from ./VoiceWaveform
    • Import useVadRecorder from ../hooks/useVadRecorder
    • When disabled prop is true, render idle state with disabled attribute cd /opt/nexus/.claude/worktrees/agent-a009558f && test -f ui/src/components/VoiceWaveform.tsx && test -f ui/src/components/VoiceMicButton.tsx && grep -q "VoiceWaveform" ui/src/components/VoiceWaveform.tsx && grep -q "VoiceMicButton" ui/src/components/VoiceMicButton.tsx && grep -q "canvas" ui/src/components/VoiceWaveform.tsx && grep -q "useVadRecorder" ui/src/components/VoiceMicButton.tsx && grep -q "Mic" ui/src/components/VoiceMicButton.tsx && grep -q "Loader2" ui/src/components/VoiceMicButton.tsx && grep -q "ring-2 ring-primary" ui/src/components/VoiceMicButton.tsx && echo "PASS" || echo "FAIL" <acceptance_criteria>
    • grep "export function VoiceWaveform" ui/src/components/VoiceWaveform.tsx returns match
    • grep "export function VoiceMicButton" ui/src/components/VoiceMicButton.tsx returns match
    • grep "canvas" ui/src/components/VoiceWaveform.tsx returns match
    • grep "AnalyserNode|createAnalyser|analyser" ui/src/components/VoiceWaveform.tsx returns match
    • grep "requestAnimationFrame" ui/src/components/VoiceWaveform.tsx returns match
    • grep "getByteFrequencyData" ui/src/components/VoiceWaveform.tsx returns match
    • grep "useVadRecorder" ui/src/components/VoiceMicButton.tsx returns match
    • grep 'aria-label="Start voice input"' ui/src/components/VoiceMicButton.tsx returns match
    • grep 'aria-label="Recording' ui/src/components/VoiceMicButton.tsx returns match
    • grep 'aria-label="Transcribing' ui/src/components/VoiceMicButton.tsx returns match
    • grep "ring-2 ring-primary" ui/src/components/VoiceMicButton.tsx returns match
    • grep "Loader2.*animate-spin" ui/src/components/VoiceMicButton.tsx returns match </acceptance_criteria> VoiceWaveform renders 20 animated bars from Web Audio API AnalyserNode on a 80x32 canvas. VoiceMicButton shows idle/recording/processing states with correct icons, aria-labels, and ring styling.
- All 5 files exist and export their named functions - useVadRecorder uses useMicVAD with startOnLoad: false and baseAssetPath: "/" - VoiceMicButton has three distinct visual states with correct aria-labels - VoiceWaveform uses canvas + AnalyserNode pattern - encodeWav produces Blob with type audio/wav - useVoiceMode reads/writes via /api/nexus/settings

<success_criteria> Core voice recording pipeline complete: user clicks mic -> VAD listens -> waveform animates -> silence detected -> audio encoded to WAV -> POSTed to /api/transcribe -> transcript returned. Voice mode readable/writable from nexus-settings. </success_criteria>

After completion, create `.planning/phases/37-web-chat-voice-ui/37-02-SUMMARY.md`