docs(39-01): complete sentence-buffered TTS streaming + multi-language synthesis plan

2026-04-04 03:34:58 +00:00 · 2026-04-04 03:34:58 +00:00 · 0ea8d35515
commit 0ea8d35515
parent 08e6b72d99
3 changed files with 111 additions and 6 deletions
--- a/.planning/REQUIREMENTS.md
+++ b/.planning/REQUIREMENTS.md
@ -13,8 +13,8 @@
 - [x] **VPIPE-04**: Audio from any source is transcoded to WAV 16kHz mono via ffmpeg before Whisper processing
 - [x] **VPIPE-05**: Voice mode flag on messages triggers voice-optimized response formatting (no markdown, natural prose)
 - [x] **VPIPE-06**: Every voice interaction produces dual output: spoken prose response + full text with code blocks
- [ ] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming)
- [ ] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS)
+- [x] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming)
+- [x] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS)

 ### Web Chat Voice

@ -78,8 +78,8 @@
 | VPIPE-04 | Phase 36 | Complete |
 | VPIPE-05 | Phase 36 | Complete |
 | VPIPE-06 | Phase 36 | Complete |
-| VPIPE-07 | Phase 39 | Pending |
-| VPIPE-08 | Phase 39 | Pending |
+| VPIPE-07 | Phase 39 | Complete |
+| VPIPE-08 | Phase 39 | Complete |
 | WCHAT-01 | Phase 37 | Complete |
 | WCHAT-02 | Phase 37 | Complete |
 | WCHAT-03 | Phase 37 | Complete |
--- a/.planning/ROADMAP.md
+++ b/.planning/ROADMAP.md
@ -168,7 +168,7 @@ Plans:
 **Plans**: 2 plans

 Plans:
- [ ] 39-01-PLAN.md — Sentence-buffered TTS streaming + multi-language synthesis
+- [x] 39-01-PLAN.md — Sentence-buffered TTS streaming + multi-language synthesis
 - [ ] 39-02-PLAN.md — Onboarding voice hardware capability probe

 ---
@ -228,4 +228,4 @@ All 23 v1.6 requirements are mapped to exactly one phase. No orphans.
 | 36. Voice Pipeline Foundation | v1.6 | 2/3 | Complete    | 2026-04-04 |
 | 37. Web Chat Voice UI | v1.6 | 3/4 | Complete    | 2026-04-04 |
 | 38. Telegram Bridge | v1.6 | 3/3 | Complete    | 2026-04-04 |
-| 39. Voice Polish | v1.6 | 0/2 | Not started | - |
+| 39. Voice Polish | v1.6 | 1/2 | In Progress|  |
--- a/.planning/phases/39-voice-polish/39-01-SUMMARY.md
+++ b/.planning/phases/39-voice-polish/39-01-SUMMARY.md
@ -0,0 +1,105 @@
+---
+phase: 39-voice-polish
+plan: "01"
+subsystem: voice
+tags: [tts, streaming, sse, multi-lang, sentence-buffering]
+dependency_graph:
+  requires: [36-voice-pipeline, 37-ui-voice]
+  provides: [sentence-streaming-sse, multi-lang-synthesis, streaming-playback]
+  affects: [ChatVoicePlayer, voice-pipeline, voice-routes]
+tech_stack:
+  added: []
+  patterns: [AsyncGenerator, SSE/EventSource, ReadableStream, base64-audio]
+key_files:
+  created:
+    - server/src/__tests__/39-sentence-streaming.test.ts
+  modified:
+    - server/src/services/voice-pipeline.ts
+    - server/src/routes/voice.ts
+    - ui/src/components/ChatVoicePlayer.tsx
+key_decisions:
+  - "splitSentences protects only title abbreviations (Dr., Mr., etc.) - acronyms like D.C. can still end sentences"
+  - "synthesizeSentenceStream uses AsyncGenerator for lazy per-sentence audio production"
+  - "ChatVoicePlayer uses fetch POST + ReadableStream to parse SSE manually (EventSource only supports GET)"
+  - "Object URL cleanup on unmount and on new text arrival prevents blob memory leaks"
+metrics:
+  duration: "~5 minutes"
+  completed_date: "2026-04-04"
+  tasks_completed: 2
+  files_modified: 4
+---
+
+# Phase 39 Plan 01: Sentence-Buffered TTS Streaming + Multi-Language Synthesis Summary
+
+Sentence-buffered SSE TTS streaming with `synthesizeSentenceStream` AsyncGenerator + `synthesizeMultiLang` Promise.all parallel synthesis + ChatVoicePlayer progressive playback with sentence queue and progress indicator.
+
+## Tasks Completed
+
+| Task | Name | Commit | Files |
+|------|------|--------|-------|
+| 1 (TDD RED) | Failing tests for sentence streaming | 2efe6f30 | server/src/__tests__/39-sentence-streaming.test.ts |
+| 1 (TDD GREEN) | Sentence streaming + multi-lang in pipeline + routes | 5c888c1a | server/src/services/voice-pipeline.ts, server/src/routes/voice.ts |
+| 2 | ChatVoicePlayer sentence-buffered streaming playback | c4b05399 | ui/src/components/ChatVoicePlayer.tsx |
+
+## What Was Built
+
+### voice-pipeline.ts additions
+
+- `splitSentences(text)` — exported function, protects title abbreviations (Dr., Mr., Mrs., etc.) from false sentence splits using placeholder technique; acronyms like D.C. at sentence end still trigger splits
+- `synthesizeSentenceStream(text, voiceId?)` — `AsyncGenerator<{ index, total, audio }>` that calls piper per sentence and yields audio buffers immediately as each completes
+- `synthesizeMultiLang(text, voiceIds[])` — runs `synthesize()` for each voiceId in parallel via `Promise.all`, returns `Map<voiceId, Buffer>`
+- Internal `synthesizeSentence()` helper extracted to avoid code duplication between the three synthesis methods
+- Existing `synthesize()` updated to use `splitSentences()` internally (backward compatible)
+
+### voice.ts additions
+
+- `POST /api/synthesize/stream` — SSE endpoint with `Content-Type: text/event-stream`, iterates `synthesizeSentenceStream`, writes `data: { index, total, audio: base64 }\n\n` per sentence, finishes with `data: { done: true }\n\n`
+- `POST /api/synthesize/multi-lang` — validates voiceIds array (1-5 entries), calls `synthesizeMultiLang`, returns `{ results: [{ voiceId, audio: base64 }] }`
+- Existing `POST /api/synthesize` unchanged
+
+### ChatVoicePlayer.tsx
+
+- Added `streaming` prop (default `true`)
+- When streaming=true: uses `fetch POST + response.body.getReader()` to parse SSE lines as they arrive (not EventSource — POST required)
+- First chunk received → decode base64, create Blob URL, set audio src, call `.play()` immediately
+- Subsequent chunks → pushed to `audioQueue` ref
+- `onEnded` handler pops next URL from queue and plays it
+- Progress indicator: "Sentence N of M" text + dot progress bar (filled/unfilled per sentence)
+- Fallback: stream error or `streaming=false` → falls back to existing full-fetch via `POST /api/synthesize`
+- All Blob URLs cleaned up on unmount or new `text` prop
+
+## Test Results
+
+```
+Tests  8 passed (8)
+- splitSentences: splits basic sentences ✓
+- splitSentences: abbreviation-aware (Dr., D.C.) ✓
+- splitSentences: single sentence ✓
+- splitSentences: filters empty strings ✓
+- synthesizeSentenceStream: yields correct chunk count + metadata ✓
+- synthesizeSentenceStream: single sentence ✓
+- synthesizeMultiLang: returns Map with all voices ✓
+- synthesizeMultiLang: calls piper in parallel ✓
+```
+
+## Deviations from Plan
+
+### Auto-fixed Issues
+
+None.
+
+### Design Notes
+
+**abbreviation handling:** The plan mentioned "Dr. Smith went to D.C. He liked it." → two sentences. The key insight was that title abbreviations (Dr., Mr., etc.) need protection but two-letter acronyms like D.C. at sentence position can still be split. The implementation uses a space-as-placeholder technique: title abbreviations have their trailing space replaced with `\x00`, then split on `(?<=[.!?])\s+`, then restore.
+
+## Known Stubs
+
+None — all endpoints are wired to real piper TTS synthesis.
+
+## Self-Check: PASSED
+
+- server/src/services/voice-pipeline.ts exists with splitSentences + synthesizeSentenceStream + synthesizeMultiLang ✓
+- server/src/routes/voice.ts has synthesize/stream + synthesize/multi-lang + text/event-stream ✓
+- server/src/__tests__/39-sentence-streaming.test.ts exists and all 8 tests pass ✓
+- ui/src/components/ChatVoicePlayer.tsx has synthesize/stream + ReadableStream + audioQueue + sentence progress ✓
+- Commits: 2efe6f30, 5c888c1a, c4b05399 ✓