--- phase: 39-voice-polish plan: 01 type: execute wave: 1 depends_on: [] files_modified: - server/src/services/voice-pipeline.ts - server/src/routes/voice.ts - ui/src/components/ChatVoicePlayer.tsx - server/src/__tests__/39-sentence-streaming.test.ts autonomous: true requirements: [VPIPE-07, VPIPE-08] must_haves: truths: - "First sentence audio begins playing before full response finishes synthesizing" - "User can request same text synthesized in multiple languages simultaneously" - "Existing single-language synthesize endpoint still works unchanged" artifacts: - path: "server/src/services/voice-pipeline.ts" provides: "synthesizeSentenceStream generator + synthesizeMultiLang method" contains: "synthesizeSentenceStream" - path: "server/src/routes/voice.ts" provides: "GET /api/synthesize/stream SSE endpoint + POST /api/synthesize/multi-lang" contains: "synthesize/stream" - path: "ui/src/components/ChatVoicePlayer.tsx" provides: "Streaming audio playback via sentence-buffered fetch" contains: "EventSource\\|ReadableStream\\|sentence" - path: "server/src/__tests__/39-sentence-streaming.test.ts" provides: "Tests for sentence splitting, multi-lang, and streaming" contains: "describe.*sentence" key_links: - from: "ui/src/components/ChatVoicePlayer.tsx" to: "/api/synthesize/stream" via: "EventSource or fetch with ReadableStream" pattern: "synthesize/stream" - from: "server/src/routes/voice.ts" to: "server/src/services/voice-pipeline.ts" via: "synthesizeSentenceStream generator" pattern: "synthesizeSentenceStream" --- Sentence-buffered TTS streaming and multi-language synthesis. Purpose: Voice responses begin playing before full synthesis completes (under 1s to first audio), and users can synthesize the same response in multiple languages without a second agent call. Output: Streaming synthesize endpoint, multi-language endpoint, updated ChatVoicePlayer with progressive playback. @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md @.planning/ROADMAP.md @.planning/REQUIREMENTS.md @.planning/phases/39-voice-polish/39-CONTEXT.md @server/src/services/voice-pipeline.ts @server/src/routes/voice.ts @ui/src/components/ChatVoicePlayer.tsx From server/src/services/voice-pipeline.ts: ```typescript // synthesize already does sentence splitting internally: // const sentences = text.split(/(?<=[.!?])\s+/).filter((s) => s.length > 0); // Currently concatenates all sentence buffers before returning. // Need to yield each sentence buffer as it completes. export function voicePipelineService(): { transcribe(buffer: Buffer, format: "webm" | "ogg" | "wav"): Promise<{ text: string; language?: string }>; synthesize(text: string, voiceId?: string): Promise; formatForVoice(text: string): string; transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise; } ``` From server/src/routes/voice.ts: ```typescript // POST /api/synthesize — takes { text, voiceId }, returns audio/wav buffer // POST /api/transcribe — takes multipart audio, returns { text, language? } ``` From ui/src/components/ChatVoicePlayer.tsx: ```typescript interface ChatVoicePlayerProps { text: string; autoPlay?: boolean; } // Currently fetches full audio blob from POST /api/synthesize, then plays ``` Task 1: Sentence-buffered synthesis + multi-language TTS in voice pipeline and routes server/src/services/voice-pipeline.ts, server/src/routes/voice.ts, server/src/__tests__/39-sentence-streaming.test.ts - server/src/services/voice-pipeline.ts (full file — understand synthesize internals) - server/src/routes/voice.ts (full file — understand route patterns) - server/src/routes/authz.ts (assertBoard pattern) - Test: splitSentences("Hello world. How are you? I am fine.") returns ["Hello world.", "How are you?", "I am fine."] - Test: splitSentences("Dr. Smith went to D.C. He liked it.") returns ["Dr. Smith went to D.C.", "He liked it."] (abbreviation-aware) - Test: synthesizeSentenceStream yields Buffer chunks one per sentence - Test: synthesizeMultiLang({ text, languages: ["en_US-lessac-medium", "da_DK-talesyntese-medium"] }) returns Map with two Buffer entries 1. In voice-pipeline.ts, extract sentence splitting into an exported `splitSentences(text: string): string[]` function. Use regex: split on /(?<=[.!?])\s+/ (same as current), filter empty. Keep existing synthesize() working by calling splitSentences internally. 2. Add `async *synthesizeSentenceStream(text: string, voiceId?: string): AsyncGenerator<{ index: number; total: number; audio: Buffer }>` method: - Call splitSentences(text) to get sentences array - For each sentence, call piper (same as current synthesize logic), yield { index, total: sentences.length, audio: audioBuffer } immediately - This gives the consumer each sentence's audio as soon as it is ready 3. Add `async synthesizeMultiLang(text: string, voiceIds: string[]): Promise>` method: - For each voiceId, call existing synthesize(text, voiceId) in parallel via Promise.all - Return Map 4. Update the return signature of voicePipelineService() to include the new methods. 5. In voice.ts, add streaming endpoint: `POST /api/synthesize/stream` — accepts { text: string, voiceId?: string } - assertBoard(req) - Set headers: Content-Type: text/event-stream, Cache-Control: no-cache, Connection: keep-alive - Iterate synthesizeSentenceStream, for each chunk: write SSE `data: { "index": N, "total": M, "audio": "" }\n\n` - On completion: write `data: { "done": true }\n\n` then res.end() 6. In voice.ts, add multi-language endpoint: `POST /api/synthesize/multi-lang` — accepts { text: string, voiceIds: string[] } - assertBoard(req) - Validate voiceIds is array with 1-5 entries - Call synthesizeMultiLang, return JSON: { results: [{ voiceId, audio: base64 }] } 7. Write tests in 39-sentence-streaming.test.ts: - Test splitSentences with basic and edge cases - Test synthesizeSentenceStream yields correct number of chunks (mock piper execFile) - Test synthesizeMultiLang returns correct number of entries (mock piper) cd /opt/nexus && npx vitest run server/src/__tests__/39-sentence-streaming.test.ts --reporter=verbose 2>&1 | tail -30 - grep -q "splitSentences" server/src/services/voice-pipeline.ts - grep -q "synthesizeSentenceStream" server/src/services/voice-pipeline.ts - grep -q "synthesizeMultiLang" server/src/services/voice-pipeline.ts - grep -q "synthesize/stream" server/src/routes/voice.ts - grep -q "synthesize/multi-lang" server/src/routes/voice.ts - grep -q "text/event-stream" server/src/routes/voice.ts - test -f server/src/__tests__/39-sentence-streaming.test.ts - splitSentences exported and tested - synthesizeSentenceStream yields per-sentence audio chunks via AsyncGenerator - synthesizeMultiLang synthesizes same text in N languages in parallel - POST /api/synthesize/stream sends SSE with base64 audio per sentence - POST /api/synthesize/multi-lang returns array of { voiceId, audio } pairs - Existing POST /api/synthesize unchanged (backward compatible) - All tests pass Task 2: ChatVoicePlayer sentence-buffered streaming playback ui/src/components/ChatVoicePlayer.tsx - ui/src/components/ChatVoicePlayer.tsx (full file — current playback implementation) - ui/src/components/ChatMessage.tsx (how ChatVoicePlayer is used) 1. Refactor ChatVoicePlayer to support streaming playback mode: - Add a `streaming` prop (default true) to ChatVoicePlayerProps - When streaming=true, use EventSource to connect to POST /api/synthesize/stream (use fetch with ReadableStream since EventSource only supports GET — instead use fetch POST then parse SSE text manually from response body stream) - Actually: use fetch with { method: "POST", body, headers } and read response.body as ReadableStream, parsing SSE lines manually 2. Streaming playback logic: - Maintain a queue of audio Buffers (base64-decoded from SSE data) - On first chunk received: decode base64 to ArrayBuffer, create Blob with audio/wav type, create object URL, set as audio src, begin playback immediately — this satisfies the "under 1 second" requirement - On subsequent chunks: queue them. When current audio `onEnded`, pop next from queue, set as new src, play - Show progress: "Playing 1/3..." in the UI 3. Sentence progress indicator: - Display "Sentence N of M" text when streaming is active - Show a small progress bar or dot indicator below the play button 4. Fallback: when streaming=false or if SSE connection fails, fall back to existing full-fetch behavior (current implementation) 5. Clean up: revoke all object URLs on unmount or when new text arrives 6. Keep the existing play/pause controls working for both modes cd /opt/nexus && npx tsc --noEmit --project ui/tsconfig.json 2>&1 | tail -20 - grep -q "synthesize/stream" ui/src/components/ChatVoicePlayer.tsx - grep -q "ReadableStream\|getReader\|TextDecoder" ui/src/components/ChatVoicePlayer.tsx - grep -q "queue\|Queue\|audioQueue" ui/src/components/ChatVoicePlayer.tsx - grep -q "Sentence.*of\|sentence.*progress\|Playing.*of" ui/src/components/ChatVoicePlayer.tsx - ChatVoicePlayer connects to /api/synthesize/stream via fetch POST + ReadableStream - First sentence audio begins playing as soon as first SSE chunk arrives - Subsequent sentences auto-play in sequence from queue - Progress indicator shows current sentence position - Falls back to full-fetch on stream error - TypeScript compiles without errors 1. TypeScript compiles: `npx tsc --noEmit` in both server and ui 2. Tests pass: `npx vitest run server/src/__tests__/39-sentence-streaming.test.ts` 3. Existing synthesize endpoint still works: grep confirms original POST /api/synthesize route unchanged 4. SSE endpoint exists: grep confirms text/event-stream header in voice.ts 5. Multi-lang endpoint exists: grep confirms synthesize/multi-lang in voice.ts - VPIPE-07: First sentence plays while subsequent sentences still synthesizing (sentence-buffered SSE streaming) - VPIPE-08: Single text can be synthesized in multiple languages via /api/synthesize/multi-lang - Backward compatible: existing /api/synthesize POST unchanged - All tests green After completion, create `.planning/phases/39-voice-polish/39-01-SUMMARY.md`