---
phase: 39-voice-polish
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- server/src/services/voice-pipeline.ts
- server/src/routes/voice.ts
- ui/src/components/ChatVoicePlayer.tsx
- server/src/__tests__/39-sentence-streaming.test.ts
autonomous: true
requirements: [VPIPE-07, VPIPE-08]
must_haves:
truths:
- "First sentence audio begins playing before full response finishes synthesizing"
- "User can request same text synthesized in multiple languages simultaneously"
- "Existing single-language synthesize endpoint still works unchanged"
artifacts:
- path: "server/src/services/voice-pipeline.ts"
provides: "synthesizeSentenceStream generator + synthesizeMultiLang method"
contains: "synthesizeSentenceStream"
- path: "server/src/routes/voice.ts"
provides: "GET /api/synthesize/stream SSE endpoint + POST /api/synthesize/multi-lang"
contains: "synthesize/stream"
- path: "ui/src/components/ChatVoicePlayer.tsx"
provides: "Streaming audio playback via sentence-buffered fetch"
contains: "EventSource\\|ReadableStream\\|sentence"
- path: "server/src/__tests__/39-sentence-streaming.test.ts"
provides: "Tests for sentence splitting, multi-lang, and streaming"
contains: "describe.*sentence"
key_links:
- from: "ui/src/components/ChatVoicePlayer.tsx"
to: "/api/synthesize/stream"
via: "EventSource or fetch with ReadableStream"
pattern: "synthesize/stream"
- from: "server/src/routes/voice.ts"
to: "server/src/services/voice-pipeline.ts"
via: "synthesizeSentenceStream generator"
pattern: "synthesizeSentenceStream"
---
Sentence-buffered TTS streaming and multi-language synthesis.
Purpose: Voice responses begin playing before full synthesis completes (under 1s to first audio), and users can synthesize the same response in multiple languages without a second agent call.
Output: Streaming synthesize endpoint, multi-language endpoint, updated ChatVoicePlayer with progressive playback.
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
@.planning/ROADMAP.md
@.planning/REQUIREMENTS.md
@.planning/phases/39-voice-polish/39-CONTEXT.md
@server/src/services/voice-pipeline.ts
@server/src/routes/voice.ts
@ui/src/components/ChatVoicePlayer.tsx
From server/src/services/voice-pipeline.ts:
```typescript
// synthesize already does sentence splitting internally:
// const sentences = text.split(/(?<=[.!?])\s+/).filter((s) => s.length > 0);
// Currently concatenates all sentence buffers before returning.
// Need to yield each sentence buffer as it completes.
export function voicePipelineService(): {
transcribe(buffer: Buffer, format: "webm" | "ogg" | "wav"): Promise<{ text: string; language?: string }>;
synthesize(text: string, voiceId?: string): Promise;
formatForVoice(text: string): string;
transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise;
}
```
From server/src/routes/voice.ts:
```typescript
// POST /api/synthesize — takes { text, voiceId }, returns audio/wav buffer
// POST /api/transcribe — takes multipart audio, returns { text, language? }
```
From ui/src/components/ChatVoicePlayer.tsx:
```typescript
interface ChatVoicePlayerProps {
text: string;
autoPlay?: boolean;
}
// Currently fetches full audio blob from POST /api/synthesize, then plays
```
Task 1: Sentence-buffered synthesis + multi-language TTS in voice pipeline and routes
server/src/services/voice-pipeline.ts, server/src/routes/voice.ts, server/src/__tests__/39-sentence-streaming.test.ts
- server/src/services/voice-pipeline.ts (full file — understand synthesize internals)
- server/src/routes/voice.ts (full file — understand route patterns)
- server/src/routes/authz.ts (assertBoard pattern)
- Test: splitSentences("Hello world. How are you? I am fine.") returns ["Hello world.", "How are you?", "I am fine."]
- Test: splitSentences("Dr. Smith went to D.C. He liked it.") returns ["Dr. Smith went to D.C.", "He liked it."] (abbreviation-aware)
- Test: synthesizeSentenceStream yields Buffer chunks one per sentence
- Test: synthesizeMultiLang({ text, languages: ["en_US-lessac-medium", "da_DK-talesyntese-medium"] }) returns Map with two Buffer entries
1. In voice-pipeline.ts, extract sentence splitting into an exported `splitSentences(text: string): string[]` function. Use regex: split on /(?<=[.!?])\s+/ (same as current), filter empty. Keep existing synthesize() working by calling splitSentences internally.
2. Add `async *synthesizeSentenceStream(text: string, voiceId?: string): AsyncGenerator<{ index: number; total: number; audio: Buffer }>` method:
- Call splitSentences(text) to get sentences array
- For each sentence, call piper (same as current synthesize logic), yield { index, total: sentences.length, audio: audioBuffer } immediately
- This gives the consumer each sentence's audio as soon as it is ready
3. Add `async synthesizeMultiLang(text: string, voiceIds: string[]): Promise
cd /opt/nexus && npx vitest run server/src/__tests__/39-sentence-streaming.test.ts --reporter=verbose 2>&1 | tail -30
- grep -q "splitSentences" server/src/services/voice-pipeline.ts
- grep -q "synthesizeSentenceStream" server/src/services/voice-pipeline.ts
- grep -q "synthesizeMultiLang" server/src/services/voice-pipeline.ts
- grep -q "synthesize/stream" server/src/routes/voice.ts
- grep -q "synthesize/multi-lang" server/src/routes/voice.ts
- grep -q "text/event-stream" server/src/routes/voice.ts
- test -f server/src/__tests__/39-sentence-streaming.test.ts
- splitSentences exported and tested
- synthesizeSentenceStream yields per-sentence audio chunks via AsyncGenerator
- synthesizeMultiLang synthesizes same text in N languages in parallel
- POST /api/synthesize/stream sends SSE with base64 audio per sentence
- POST /api/synthesize/multi-lang returns array of { voiceId, audio } pairs
- Existing POST /api/synthesize unchanged (backward compatible)
- All tests pass
Task 2: ChatVoicePlayer sentence-buffered streaming playback
ui/src/components/ChatVoicePlayer.tsx
- ui/src/components/ChatVoicePlayer.tsx (full file — current playback implementation)
- ui/src/components/ChatMessage.tsx (how ChatVoicePlayer is used)
1. Refactor ChatVoicePlayer to support streaming playback mode:
- Add a `streaming` prop (default true) to ChatVoicePlayerProps
- When streaming=true, use EventSource to connect to POST /api/synthesize/stream (use fetch with ReadableStream since EventSource only supports GET — instead use fetch POST then parse SSE text manually from response body stream)
- Actually: use fetch with { method: "POST", body, headers } and read response.body as ReadableStream, parsing SSE lines manually
2. Streaming playback logic:
- Maintain a queue of audio Buffers (base64-decoded from SSE data)
- On first chunk received: decode base64 to ArrayBuffer, create Blob with audio/wav type, create object URL, set as audio src, begin playback immediately — this satisfies the "under 1 second" requirement
- On subsequent chunks: queue them. When current audio `onEnded`, pop next from queue, set as new src, play
- Show progress: "Playing 1/3..." in the UI
3. Sentence progress indicator:
- Display "Sentence N of M" text when streaming is active
- Show a small progress bar or dot indicator below the play button
4. Fallback: when streaming=false or if SSE connection fails, fall back to existing full-fetch behavior (current implementation)
5. Clean up: revoke all object URLs on unmount or when new text arrives
6. Keep the existing play/pause controls working for both modes
cd /opt/nexus && npx tsc --noEmit --project ui/tsconfig.json 2>&1 | tail -20
- grep -q "synthesize/stream" ui/src/components/ChatVoicePlayer.tsx
- grep -q "ReadableStream\|getReader\|TextDecoder" ui/src/components/ChatVoicePlayer.tsx
- grep -q "queue\|Queue\|audioQueue" ui/src/components/ChatVoicePlayer.tsx
- grep -q "Sentence.*of\|sentence.*progress\|Playing.*of" ui/src/components/ChatVoicePlayer.tsx
- ChatVoicePlayer connects to /api/synthesize/stream via fetch POST + ReadableStream
- First sentence audio begins playing as soon as first SSE chunk arrives
- Subsequent sentences auto-play in sequence from queue
- Progress indicator shows current sentence position
- Falls back to full-fetch on stream error
- TypeScript compiles without errors
1. TypeScript compiles: `npx tsc --noEmit` in both server and ui
2. Tests pass: `npx vitest run server/src/__tests__/39-sentence-streaming.test.ts`
3. Existing synthesize endpoint still works: grep confirms original POST /api/synthesize route unchanged
4. SSE endpoint exists: grep confirms text/event-stream header in voice.ts
5. Multi-lang endpoint exists: grep confirms synthesize/multi-lang in voice.ts
- VPIPE-07: First sentence plays while subsequent sentences still synthesizing (sentence-buffered SSE streaming)
- VPIPE-08: Single text can be synthesized in multiple languages via /api/synthesize/multi-lang
- Backward compatible: existing /api/synthesize POST unchanged
- All tests green