nexus/.planning/phases/39-voice-polish/39-01-SUMMARY.md

5.3 KiB

phase plan subsystem tags dependency_graph tech_stack key_files key_decisions metrics
39-voice-polish 01 voice
tts
streaming
sse
multi-lang
sentence-buffering
requires provides affects
36-voice-pipeline
37-ui-voice
sentence-streaming-sse
multi-lang-synthesis
streaming-playback
ChatVoicePlayer
voice-pipeline
voice-routes
added patterns
AsyncGenerator
SSE/EventSource
ReadableStream
base64-audio
created modified
server/src/__tests__/39-sentence-streaming.test.ts
server/src/services/voice-pipeline.ts
server/src/routes/voice.ts
ui/src/components/ChatVoicePlayer.tsx
splitSentences protects only title abbreviations (Dr., Mr., etc.) - acronyms like D.C. can still end sentences
synthesizeSentenceStream uses AsyncGenerator for lazy per-sentence audio production
ChatVoicePlayer uses fetch POST + ReadableStream to parse SSE manually (EventSource only supports GET)
Object URL cleanup on unmount and on new text arrival prevents blob memory leaks
duration completed_date tasks_completed files_modified
~5 minutes 2026-04-04 2 4

Phase 39 Plan 01: Sentence-Buffered TTS Streaming + Multi-Language Synthesis Summary

Sentence-buffered SSE TTS streaming with synthesizeSentenceStream AsyncGenerator + synthesizeMultiLang Promise.all parallel synthesis + ChatVoicePlayer progressive playback with sentence queue and progress indicator.

Tasks Completed

Task Name Commit Files
1 (TDD RED) Failing tests for sentence streaming 2efe6f30 server/src/tests/39-sentence-streaming.test.ts
1 (TDD GREEN) Sentence streaming + multi-lang in pipeline + routes 5c888c1a server/src/services/voice-pipeline.ts, server/src/routes/voice.ts
2 ChatVoicePlayer sentence-buffered streaming playback c4b05399 ui/src/components/ChatVoicePlayer.tsx

What Was Built

voice-pipeline.ts additions

  • splitSentences(text) — exported function, protects title abbreviations (Dr., Mr., Mrs., etc.) from false sentence splits using placeholder technique; acronyms like D.C. at sentence end still trigger splits
  • synthesizeSentenceStream(text, voiceId?)AsyncGenerator<{ index, total, audio }> that calls piper per sentence and yields audio buffers immediately as each completes
  • synthesizeMultiLang(text, voiceIds[]) — runs synthesize() for each voiceId in parallel via Promise.all, returns Map<voiceId, Buffer>
  • Internal synthesizeSentence() helper extracted to avoid code duplication between the three synthesis methods
  • Existing synthesize() updated to use splitSentences() internally (backward compatible)

voice.ts additions

  • POST /api/synthesize/stream — SSE endpoint with Content-Type: text/event-stream, iterates synthesizeSentenceStream, writes data: { index, total, audio: base64 }\n\n per sentence, finishes with data: { done: true }\n\n
  • POST /api/synthesize/multi-lang — validates voiceIds array (1-5 entries), calls synthesizeMultiLang, returns { results: [{ voiceId, audio: base64 }] }
  • Existing POST /api/synthesize unchanged

ChatVoicePlayer.tsx

  • Added streaming prop (default true)
  • When streaming=true: uses fetch POST + response.body.getReader() to parse SSE lines as they arrive (not EventSource — POST required)
  • First chunk received → decode base64, create Blob URL, set audio src, call .play() immediately
  • Subsequent chunks → pushed to audioQueue ref
  • onEnded handler pops next URL from queue and plays it
  • Progress indicator: "Sentence N of M" text + dot progress bar (filled/unfilled per sentence)
  • Fallback: stream error or streaming=false → falls back to existing full-fetch via POST /api/synthesize
  • All Blob URLs cleaned up on unmount or new text prop

Test Results

Tests  8 passed (8)
- splitSentences: splits basic sentences ✓
- splitSentences: abbreviation-aware (Dr., D.C.) ✓
- splitSentences: single sentence ✓
- splitSentences: filters empty strings ✓
- synthesizeSentenceStream: yields correct chunk count + metadata ✓
- synthesizeSentenceStream: single sentence ✓
- synthesizeMultiLang: returns Map with all voices ✓
- synthesizeMultiLang: calls piper in parallel ✓

Deviations from Plan

Auto-fixed Issues

None.

Design Notes

abbreviation handling: The plan mentioned "Dr. Smith went to D.C. He liked it." → two sentences. The key insight was that title abbreviations (Dr., Mr., etc.) need protection but two-letter acronyms like D.C. at sentence position can still be split. The implementation uses a space-as-placeholder technique: title abbreviations have their trailing space replaced with \x00, then split on (?<=[.!?])\s+, then restore.

Known Stubs

None — all endpoints are wired to real piper TTS synthesis.

Self-Check: PASSED

  • server/src/services/voice-pipeline.ts exists with splitSentences + synthesizeSentenceStream + synthesizeMultiLang ✓
  • server/src/routes/voice.ts has synthesize/stream + synthesize/multi-lang + text/event-stream ✓
  • server/src/tests/39-sentence-streaming.test.ts exists and all 8 tests pass ✓
  • ui/src/components/ChatVoicePlayer.tsx has synthesize/stream + ReadableStream + audioQueue + sentence progress ✓
  • Commits: 2efe6f30, 5c888c1a, c4b05399 ✓