docs(39-01): complete sentence-buffered TTS streaming + multi-language synthesis plan

This commit is contained in:
Nexus Dev 2026-04-04 03:34:58 +00:00
parent 08e6b72d99
commit 0ea8d35515
3 changed files with 111 additions and 6 deletions

View file

@ -13,8 +13,8 @@
- [x] **VPIPE-04**: Audio from any source is transcoded to WAV 16kHz mono via ffmpeg before Whisper processing
- [x] **VPIPE-05**: Voice mode flag on messages triggers voice-optimized response formatting (no markdown, natural prose)
- [x] **VPIPE-06**: Every voice interaction produces dual output: spoken prose response + full text with code blocks
- [ ] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming)
- [ ] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS)
- [x] **VPIPE-07**: TTS plays first sentence while subsequent sentences are still synthesizing (sentence-buffered streaming)
- [x] **VPIPE-08**: User can synthesize a single text response into multiple language audio outputs (multi-language TTS)
### Web Chat Voice
@ -78,8 +78,8 @@
| VPIPE-04 | Phase 36 | Complete |
| VPIPE-05 | Phase 36 | Complete |
| VPIPE-06 | Phase 36 | Complete |
| VPIPE-07 | Phase 39 | Pending |
| VPIPE-08 | Phase 39 | Pending |
| VPIPE-07 | Phase 39 | Complete |
| VPIPE-08 | Phase 39 | Complete |
| WCHAT-01 | Phase 37 | Complete |
| WCHAT-02 | Phase 37 | Complete |
| WCHAT-03 | Phase 37 | Complete |

View file

@ -168,7 +168,7 @@ Plans:
**Plans**: 2 plans
Plans:
- [ ] 39-01-PLAN.md — Sentence-buffered TTS streaming + multi-language synthesis
- [x] 39-01-PLAN.md — Sentence-buffered TTS streaming + multi-language synthesis
- [ ] 39-02-PLAN.md — Onboarding voice hardware capability probe
---
@ -228,4 +228,4 @@ All 23 v1.6 requirements are mapped to exactly one phase. No orphans.
| 36. Voice Pipeline Foundation | v1.6 | 2/3 | Complete | 2026-04-04 |
| 37. Web Chat Voice UI | v1.6 | 3/4 | Complete | 2026-04-04 |
| 38. Telegram Bridge | v1.6 | 3/3 | Complete | 2026-04-04 |
| 39. Voice Polish | v1.6 | 0/2 | Not started | - |
| 39. Voice Polish | v1.6 | 1/2 | In Progress| |

View file

@ -0,0 +1,105 @@
---
phase: 39-voice-polish
plan: "01"
subsystem: voice
tags: [tts, streaming, sse, multi-lang, sentence-buffering]
dependency_graph:
requires: [36-voice-pipeline, 37-ui-voice]
provides: [sentence-streaming-sse, multi-lang-synthesis, streaming-playback]
affects: [ChatVoicePlayer, voice-pipeline, voice-routes]
tech_stack:
added: []
patterns: [AsyncGenerator, SSE/EventSource, ReadableStream, base64-audio]
key_files:
created:
- server/src/__tests__/39-sentence-streaming.test.ts
modified:
- server/src/services/voice-pipeline.ts
- server/src/routes/voice.ts
- ui/src/components/ChatVoicePlayer.tsx
key_decisions:
- "splitSentences protects only title abbreviations (Dr., Mr., etc.) - acronyms like D.C. can still end sentences"
- "synthesizeSentenceStream uses AsyncGenerator for lazy per-sentence audio production"
- "ChatVoicePlayer uses fetch POST + ReadableStream to parse SSE manually (EventSource only supports GET)"
- "Object URL cleanup on unmount and on new text arrival prevents blob memory leaks"
metrics:
duration: "~5 minutes"
completed_date: "2026-04-04"
tasks_completed: 2
files_modified: 4
---
# Phase 39 Plan 01: Sentence-Buffered TTS Streaming + Multi-Language Synthesis Summary
Sentence-buffered SSE TTS streaming with `synthesizeSentenceStream` AsyncGenerator + `synthesizeMultiLang` Promise.all parallel synthesis + ChatVoicePlayer progressive playback with sentence queue and progress indicator.
## Tasks Completed
| Task | Name | Commit | Files |
|------|------|--------|-------|
| 1 (TDD RED) | Failing tests for sentence streaming | 2efe6f30 | server/src/__tests__/39-sentence-streaming.test.ts |
| 1 (TDD GREEN) | Sentence streaming + multi-lang in pipeline + routes | 5c888c1a | server/src/services/voice-pipeline.ts, server/src/routes/voice.ts |
| 2 | ChatVoicePlayer sentence-buffered streaming playback | c4b05399 | ui/src/components/ChatVoicePlayer.tsx |
## What Was Built
### voice-pipeline.ts additions
- `splitSentences(text)` — exported function, protects title abbreviations (Dr., Mr., Mrs., etc.) from false sentence splits using placeholder technique; acronyms like D.C. at sentence end still trigger splits
- `synthesizeSentenceStream(text, voiceId?)``AsyncGenerator<{ index, total, audio }>` that calls piper per sentence and yields audio buffers immediately as each completes
- `synthesizeMultiLang(text, voiceIds[])` — runs `synthesize()` for each voiceId in parallel via `Promise.all`, returns `Map<voiceId, Buffer>`
- Internal `synthesizeSentence()` helper extracted to avoid code duplication between the three synthesis methods
- Existing `synthesize()` updated to use `splitSentences()` internally (backward compatible)
### voice.ts additions
- `POST /api/synthesize/stream` — SSE endpoint with `Content-Type: text/event-stream`, iterates `synthesizeSentenceStream`, writes `data: { index, total, audio: base64 }\n\n` per sentence, finishes with `data: { done: true }\n\n`
- `POST /api/synthesize/multi-lang` — validates voiceIds array (1-5 entries), calls `synthesizeMultiLang`, returns `{ results: [{ voiceId, audio: base64 }] }`
- Existing `POST /api/synthesize` unchanged
### ChatVoicePlayer.tsx
- Added `streaming` prop (default `true`)
- When streaming=true: uses `fetch POST + response.body.getReader()` to parse SSE lines as they arrive (not EventSource — POST required)
- First chunk received → decode base64, create Blob URL, set audio src, call `.play()` immediately
- Subsequent chunks → pushed to `audioQueue` ref
- `onEnded` handler pops next URL from queue and plays it
- Progress indicator: "Sentence N of M" text + dot progress bar (filled/unfilled per sentence)
- Fallback: stream error or `streaming=false` → falls back to existing full-fetch via `POST /api/synthesize`
- All Blob URLs cleaned up on unmount or new `text` prop
## Test Results
```
Tests 8 passed (8)
- splitSentences: splits basic sentences ✓
- splitSentences: abbreviation-aware (Dr., D.C.) ✓
- splitSentences: single sentence ✓
- splitSentences: filters empty strings ✓
- synthesizeSentenceStream: yields correct chunk count + metadata ✓
- synthesizeSentenceStream: single sentence ✓
- synthesizeMultiLang: returns Map with all voices ✓
- synthesizeMultiLang: calls piper in parallel ✓
```
## Deviations from Plan
### Auto-fixed Issues
None.
### Design Notes
**abbreviation handling:** The plan mentioned "Dr. Smith went to D.C. He liked it." → two sentences. The key insight was that title abbreviations (Dr., Mr., etc.) need protection but two-letter acronyms like D.C. at sentence position can still be split. The implementation uses a space-as-placeholder technique: title abbreviations have their trailing space replaced with `\x00`, then split on `(?<=[.!?])\s+`, then restore.
## Known Stubs
None — all endpoints are wired to real piper TTS synthesis.
## Self-Check: PASSED
- server/src/services/voice-pipeline.ts exists with splitSentences + synthesizeSentenceStream + synthesizeMultiLang ✓
- server/src/routes/voice.ts has synthesize/stream + synthesize/multi-lang + text/event-stream ✓
- server/src/__tests__/39-sentence-streaming.test.ts exists and all 8 tests pass ✓
- ui/src/components/ChatVoicePlayer.tsx has synthesize/stream + ReadableStream + audioQueue + sentence progress ✓
- Commits: 2efe6f30, 5c888c1a, c4b05399 ✓