From 0ea8d3551593a7928b0f5c87625831e2ef9142e8 Mon Sep 17 00:00:00 2001 From: Nexus Dev Date: Sat, 4 Apr 2026 03:34:58 +0000 Subject: [PATCH] docs(39-01): complete sentence-buffered TTS streaming + multi-language synthesis plan --- .planning/REQUIREMENTS.md | 8 +- .planning/ROADMAP.md | 4 +- .../phases/39-voice-polish/39-01-SUMMARY.md | 105 ++++++++++++++++++ 3 files changed, 111 insertions(+), 6 deletions(-) create mode 100644 .planning/phases/39-voice-polish/39-01-SUMMARY.md diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md index afc6a78a..f992529d 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/REQUIREMENTS.md @@ -13,8 +13,8 @@ - [x] **VPIPE-04**: Audio from any source is transcoded to WAV 16kHz mono via ffmpeg before Whisper processing - [x] **VPIPE-05**: Voice mode flag on messages triggers voice-optimized response formatting (no markdown, natural prose) - [x] **VPIPE-06**: Every voice interaction produces dual output: spoken prose response + full text with code blocks -- [ ] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming) -- [ ] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS) +- [x] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming) +- [x] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS) ### Web Chat Voice @@ -78,8 +78,8 @@ | VPIPE-04 | Phase 36 | Complete | | VPIPE-05 | Phase 36 | Complete | | VPIPE-06 | Phase 36 | Complete | -| VPIPE-07 | Phase 39 | Pending | -| VPIPE-08 | Phase 39 | Pending | +| VPIPE-07 | Phase 39 | Complete | +| VPIPE-08 | Phase 39 | Complete | | WCHAT-01 | Phase 37 | Complete | | WCHAT-02 | Phase 37 | Complete | | WCHAT-03 | Phase 37 | Complete | diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index e66a1132..3f92679f 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -168,7 +168,7 @@ Plans: **Plans**: 2 plans Plans: -- [ ] 39-01-PLAN.md — Sentence-buffered TTS streaming + multi-language synthesis +- [x] 39-01-PLAN.md — Sentence-buffered TTS streaming + multi-language synthesis - [ ] 39-02-PLAN.md — Onboarding voice hardware capability probe --- @@ -228,4 +228,4 @@ All 23 v1.6 requirements are mapped to exactly one phase. No orphans. | 36. Voice Pipeline Foundation | v1.6 | 2/3 | Complete | 2026-04-04 | | 37. Web Chat Voice UI | v1.6 | 3/4 | Complete | 2026-04-04 | | 38. Telegram Bridge | v1.6 | 3/3 | Complete | 2026-04-04 | -| 39. Voice Polish | v1.6 | 0/2 | Not started | - | +| 39. Voice Polish | v1.6 | 1/2 | In Progress| | diff --git a/.planning/phases/39-voice-polish/39-01-SUMMARY.md b/.planning/phases/39-voice-polish/39-01-SUMMARY.md new file mode 100644 index 00000000..b64ebd63 --- /dev/null +++ b/.planning/phases/39-voice-polish/39-01-SUMMARY.md @@ -0,0 +1,105 @@ +--- +phase: 39-voice-polish +plan: "01" +subsystem: voice +tags: [tts, streaming, sse, multi-lang, sentence-buffering] +dependency_graph: + requires: [36-voice-pipeline, 37-ui-voice] + provides: [sentence-streaming-sse, multi-lang-synthesis, streaming-playback] + affects: [ChatVoicePlayer, voice-pipeline, voice-routes] +tech_stack: + added: [] + patterns: [AsyncGenerator, SSE/EventSource, ReadableStream, base64-audio] +key_files: + created: + - server/src/__tests__/39-sentence-streaming.test.ts + modified: + - server/src/services/voice-pipeline.ts + - server/src/routes/voice.ts + - ui/src/components/ChatVoicePlayer.tsx +key_decisions: + - "splitSentences protects only title abbreviations (Dr., Mr., etc.) - acronyms like D.C. can still end sentences" + - "synthesizeSentenceStream uses AsyncGenerator for lazy per-sentence audio production" + - "ChatVoicePlayer uses fetch POST + ReadableStream to parse SSE manually (EventSource only supports GET)" + - "Object URL cleanup on unmount and on new text arrival prevents blob memory leaks" +metrics: + duration: "~5 minutes" + completed_date: "2026-04-04" + tasks_completed: 2 + files_modified: 4 +--- + +# Phase 39 Plan 01: Sentence-Buffered TTS Streaming + Multi-Language Synthesis Summary + +Sentence-buffered SSE TTS streaming with `synthesizeSentenceStream` AsyncGenerator + `synthesizeMultiLang` Promise.all parallel synthesis + ChatVoicePlayer progressive playback with sentence queue and progress indicator. + +## Tasks Completed + +| Task | Name | Commit | Files | +|------|------|--------|-------| +| 1 (TDD RED) | Failing tests for sentence streaming | 2efe6f30 | server/src/__tests__/39-sentence-streaming.test.ts | +| 1 (TDD GREEN) | Sentence streaming + multi-lang in pipeline + routes | 5c888c1a | server/src/services/voice-pipeline.ts, server/src/routes/voice.ts | +| 2 | ChatVoicePlayer sentence-buffered streaming playback | c4b05399 | ui/src/components/ChatVoicePlayer.tsx | + +## What Was Built + +### voice-pipeline.ts additions + +- `splitSentences(text)` — exported function, protects title abbreviations (Dr., Mr., Mrs., etc.) from false sentence splits using placeholder technique; acronyms like D.C. at sentence end still trigger splits +- `synthesizeSentenceStream(text, voiceId?)` — `AsyncGenerator<{ index, total, audio }>` that calls piper per sentence and yields audio buffers immediately as each completes +- `synthesizeMultiLang(text, voiceIds[])` — runs `synthesize()` for each voiceId in parallel via `Promise.all`, returns `Map` +- Internal `synthesizeSentence()` helper extracted to avoid code duplication between the three synthesis methods +- Existing `synthesize()` updated to use `splitSentences()` internally (backward compatible) + +### voice.ts additions + +- `POST /api/synthesize/stream` — SSE endpoint with `Content-Type: text/event-stream`, iterates `synthesizeSentenceStream`, writes `data: { index, total, audio: base64 }\n\n` per sentence, finishes with `data: { done: true }\n\n` +- `POST /api/synthesize/multi-lang` — validates voiceIds array (1-5 entries), calls `synthesizeMultiLang`, returns `{ results: [{ voiceId, audio: base64 }] }` +- Existing `POST /api/synthesize` unchanged + +### ChatVoicePlayer.tsx + +- Added `streaming` prop (default `true`) +- When streaming=true: uses `fetch POST + response.body.getReader()` to parse SSE lines as they arrive (not EventSource — POST required) +- First chunk received → decode base64, create Blob URL, set audio src, call `.play()` immediately +- Subsequent chunks → pushed to `audioQueue` ref +- `onEnded` handler pops next URL from queue and plays it +- Progress indicator: "Sentence N of M" text + dot progress bar (filled/unfilled per sentence) +- Fallback: stream error or `streaming=false` → falls back to existing full-fetch via `POST /api/synthesize` +- All Blob URLs cleaned up on unmount or new `text` prop + +## Test Results + +``` +Tests 8 passed (8) +- splitSentences: splits basic sentences ✓ +- splitSentences: abbreviation-aware (Dr., D.C.) ✓ +- splitSentences: single sentence ✓ +- splitSentences: filters empty strings ✓ +- synthesizeSentenceStream: yields correct chunk count + metadata ✓ +- synthesizeSentenceStream: single sentence ✓ +- synthesizeMultiLang: returns Map with all voices ✓ +- synthesizeMultiLang: calls piper in parallel ✓ +``` + +## Deviations from Plan + +### Auto-fixed Issues + +None. + +### Design Notes + +**abbreviation handling:** The plan mentioned "Dr. Smith went to D.C. He liked it." → two sentences. The key insight was that title abbreviations (Dr., Mr., etc.) need protection but two-letter acronyms like D.C. at sentence position can still be split. The implementation uses a space-as-placeholder technique: title abbreviations have their trailing space replaced with `\x00`, then split on `(?<=[.!?])\s+`, then restore. + +## Known Stubs + +None — all endpoints are wired to real piper TTS synthesis. + +## Self-Check: PASSED + +- server/src/services/voice-pipeline.ts exists with splitSentences + synthesizeSentenceStream + synthesizeMultiLang ✓ +- server/src/routes/voice.ts has synthesize/stream + synthesize/multi-lang + text/event-stream ✓ +- server/src/__tests__/39-sentence-streaming.test.ts exists and all 8 tests pass ✓ +- ui/src/components/ChatVoicePlayer.tsx has synthesize/stream + ReadableStream + audioQueue + sentence progress ✓ +- Commits: 2efe6f30, 5c888c1a, c4b05399 ✓