191 lines
11 KiB
Markdown
191 lines
11 KiB
Markdown
---
|
|
phase: 36-voice-pipeline-foundation
|
|
plan: 01
|
|
type: execute
|
|
wave: 1
|
|
depends_on: []
|
|
files_modified:
|
|
- server/src/services/voice-pipeline.ts
|
|
- server/src/__tests__/36-voice-pipeline.test.ts
|
|
- server/package.json
|
|
autonomous: true
|
|
requirements:
|
|
- VPIPE-01
|
|
- VPIPE-02
|
|
- VPIPE-04
|
|
- VPIPE-06
|
|
|
|
must_haves:
|
|
truths:
|
|
- "transcribe() accepts a Buffer and format string, returns { text, language? }"
|
|
- "synthesize() accepts text string and optional voiceId, returns a WAV Buffer"
|
|
- "transcodeToWav16k() converts any input format to WAV 16kHz mono via ffmpeg-static"
|
|
- "formatForVoice() strips markdown and extracts SPOKEN section when present"
|
|
- "formatForVoice() falls back to markdown stripping when SPOKEN marker is absent"
|
|
artifacts:
|
|
- path: "server/src/services/voice-pipeline.ts"
|
|
provides: "VoicePipelineService with transcribe, synthesize, formatForVoice, transcodeToWav16k"
|
|
exports: ["voicePipelineService"]
|
|
- path: "server/src/__tests__/36-voice-pipeline.test.ts"
|
|
provides: "Unit tests for voice pipeline service"
|
|
min_lines: 80
|
|
key_links:
|
|
- from: "server/src/services/voice-pipeline.ts"
|
|
to: "ffmpeg-static"
|
|
via: "import ffmpegPath from ffmpeg-static"
|
|
pattern: "import.*ffmpeg-static"
|
|
- from: "server/src/services/voice-pipeline.ts"
|
|
to: "node:child_process"
|
|
via: "execFile and spawn for piper/ffmpeg"
|
|
pattern: "execFile|spawn"
|
|
---
|
|
|
|
<objective>
|
|
Create VoicePipelineService — the transport-agnostic voice service that all downstream consumers (voice routes, Telegram bridge) depend on.
|
|
|
|
Purpose: This is the keystone service for the entire v1.6 milestone. It encapsulates STT (Whisper), TTS (Piper), audio transcoding (ffmpeg), and dual-output formatting behind a clean factory function API.
|
|
|
|
Output: `server/src/services/voice-pipeline.ts` with `transcribe()`, `synthesize()`, `formatForVoice()` methods, plus unit tests.
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
|
@$HOME/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@.planning/PROJECT.md
|
|
@.planning/ROADMAP.md
|
|
@.planning/STATE.md
|
|
@.planning/phases/36-voice-pipeline-foundation/36-RESEARCH.md
|
|
|
|
<interfaces>
|
|
<!-- Existing patterns to follow -->
|
|
|
|
From server/src/services/nexus-settings.ts (factory function pattern):
|
|
```typescript
|
|
export function nexusSettingsService() {
|
|
async function get(): Promise<NexusSettings> { ... }
|
|
async function set(patch: Partial<NexusSettings>): Promise<NexusSettings> { ... }
|
|
return { get, set };
|
|
}
|
|
```
|
|
|
|
From server/src/routes/chat-files.ts lines 297-386 (existing whisper cascade to extract and move):
|
|
```typescript
|
|
// Writes raw WebM to temp file, then tries whisper-cpp --model base.en --no-timestamps
|
|
// Falls back to openai-whisper Python CLI
|
|
// Uses promisify(execFileCb) pattern from node:child_process
|
|
```
|
|
</interfaces>
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 1: Install ffmpeg-static and create VoicePipelineService with tests</name>
|
|
<files>
|
|
server/package.json
|
|
server/src/services/voice-pipeline.ts
|
|
server/src/__tests__/36-voice-pipeline.test.ts
|
|
</files>
|
|
<read_first>
|
|
server/src/services/nexus-settings.ts
|
|
server/src/routes/chat-files.ts
|
|
server/package.json
|
|
</read_first>
|
|
<behavior>
|
|
- transcodeToWav16k: spawns ffmpeg with args ["-f", inputFormat, "-i", "pipe:0", "-ar", "16000", "-ac", "1", "-f", "wav", "pipe:1"] and returns Buffer
|
|
- transcodeToWav16k: rejects with error when ffmpeg exits non-zero
|
|
- transcribe: calls transcodeToWav16k first (unless format is "wav"), then writes to temp file, runs whisper-cpp with --language auto flag, returns { text, language }
|
|
- transcribe: falls back to openai-whisper Python CLI when whisper-cpp fails
|
|
- transcribe: returns 503-style error when neither whisper binary is available
|
|
- synthesize: splits text into sentences using /(?<=[.!?])\s+/ regex, calls piper via execFile for each chunk, concatenates WAV buffers
|
|
- synthesize: wraps each piper call in Promise.race with 8000ms timeout
|
|
- synthesize: returns error when piper binary is not found
|
|
- formatForVoice: extracts text between "SPOKEN:" and "DETAILED:" markers when both present
|
|
- formatForVoice: strips markdown (headings ##, bold **, italic *, code fences ```, bullet points -/*) when SPOKEN marker is absent
|
|
- formatForVoice: handles empty string input returning empty string
|
|
- voicePipelineService factory: throws Error("ffmpeg-static binary not found") when ffmpegPath is null
|
|
</behavior>
|
|
<action>
|
|
1. Install ffmpeg-static:
|
|
```bash
|
|
cd /opt/nexus/server && pnpm add ffmpeg-static && pnpm add -D @types/ffmpeg-static
|
|
```
|
|
|
|
2. Create `server/src/__tests__/36-voice-pipeline.test.ts` with unit tests (RED phase):
|
|
- Mock `node:child_process` execFile and spawn
|
|
- Mock `ffmpeg-static` to return "/mock/ffmpeg" by default, null for the fail-fast test
|
|
- Test `transcodeToWav16k("webm")` verifies spawn is called with args ["-f", "webm", "-i", "pipe:0", "-ar", "16000", "-ac", "1", "-f", "wav", "pipe:1"]
|
|
- Test `transcribe(buffer, "webm")` verifies it calls transcodeToWav16k then whisper-cpp with `--language auto`
|
|
- Test `transcribe` whisper-cpp fallback to openai-whisper
|
|
- Test `synthesize("Hello world. How are you?")` verifies it splits into 2 sentences and calls piper execFile twice
|
|
- Test `synthesize` timeout rejects after 8000ms
|
|
- Test `formatForVoice("SPOKEN: Hello\n\nDETAILED: ## Hello\n**world**")` returns "Hello"
|
|
- Test `formatForVoice("## Hello\n**world**\n- item\n```code```")` returns "Hello\nworld\nitem\ncode" (markdown stripped)
|
|
- Test `formatForVoice("")` returns ""
|
|
- Test factory throws when ffmpegPath is null
|
|
|
|
3. Create `server/src/services/voice-pipeline.ts` (GREEN phase):
|
|
- Export `voicePipelineService()` factory function (matches nexus-settings pattern)
|
|
- Assert `ffmpegPath` is not null at construction time: `if (!ffmpegPath) throw new Error("ffmpeg-static binary not found on this platform");`
|
|
- `transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer>` — uses `spawn(ffmpegPath, ["-f", inputFormat, "-i", "pipe:0", "-ar", "16000", "-ac", "1", "-f", "wav", "pipe:1"], { stdio: ["pipe", "pipe", "pipe"] })`; write inputBuffer to stdin, collect stdout chunks, resolve on close code 0, reject otherwise
|
|
- `withTimeout<T>(promise: Promise<T>, ms: number): Promise<T>` — `Promise.race([promise, new Promise<never>((_, reject) => setTimeout(() => reject(new Error("Timed out after ${ms}ms")), ms))])`
|
|
- `transcribe(buffer: Buffer, format: "webm" | "ogg" | "wav"): Promise<{ text: string; language?: string }>`:
|
|
1. If format !== "wav", call transcodeToWav16k(buffer, format)
|
|
2. Write WAV to temp file: `path.join(tmpdir(), "nexus-audio-" + Date.now() + ".wav")`
|
|
3. Try whisper-cpp: `execFile("whisper-cpp", ["--model", "base.en", "--file", tmpPath, "--no-timestamps", "--output-txt", "--language", "auto"], { timeout: 30000 })`
|
|
4. Parse language from whisper-cpp stdout if present; return `{ text: stdout.trim(), language }`
|
|
5. On failure, try openai-whisper: `execFile("whisper", [tmpPath, "--model", "base.en", "--output_format", "txt", "--output_dir", tmpdir()], { timeout: 60000 })`
|
|
6. On both failure, throw new Error("Whisper not available. Install whisper-cpp or openai-whisper for voice input.")
|
|
7. Cleanup temp file in `finally` block via `unlink(tmpPath).catch(() => {})`
|
|
- `synthesize(text: string, voiceId?: string): Promise<Buffer>`:
|
|
1. Split text into sentences: `text.split(/(?<=[.!?])\s+/).filter(s => s.length > 0)`
|
|
2. For each sentence, call piper via `withTimeout(execFile("piper", ["--model", voiceId || "en_US-lessac-medium", "--output-raw"], { timeout: 8000, maxBuffer: 10 * 1024 * 1024, input: sentence }), 8000)`
|
|
3. Concatenate all output buffers
|
|
4. On ENOENT (piper not found), throw new Error("Piper TTS not available. Install piper for voice output.")
|
|
- `formatForVoice(text: string): string`:
|
|
1. If empty, return ""
|
|
2. Check for `SPOKEN:` marker: `const spokenMatch = text.match(/SPOKEN:\s*([\s\S]*?)(?=\nDETAILED:|\n\n[A-Z]+:)/)`
|
|
3. If match found, return `spokenMatch[1].trim()`
|
|
4. Otherwise strip markdown: remove `# ` headings, `**` bold, `*` italic, triple backtick code fences (and lang identifier), `- ` and `* ` bullet prefixes, inline backticks
|
|
5. Collapse multiple blank lines, trim
|
|
- Return `{ transcribe, synthesize, formatForVoice }`
|
|
</action>
|
|
<verify>
|
|
<automated>cd /opt/nexus && pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- server/src/services/voice-pipeline.ts contains `export function voicePipelineService()`
|
|
- server/src/services/voice-pipeline.ts contains `import ffmpegPath from "ffmpeg-static"`
|
|
- server/src/services/voice-pipeline.ts contains `spawn(ffmpegPath` (not exec)
|
|
- server/src/services/voice-pipeline.ts contains `execFile("whisper-cpp"` (not exec)
|
|
- server/src/services/voice-pipeline.ts contains `"--language", "auto"` for VPIPE-01 language detection
|
|
- server/src/services/voice-pipeline.ts contains `"-ar", "16000", "-ac", "1"` for VPIPE-04 transcoding
|
|
- server/src/services/voice-pipeline.ts contains `Promise.race` for timeout wrapping
|
|
- server/src/services/voice-pipeline.ts contains `SPOKEN:` marker check in formatForVoice
|
|
- server/src/services/voice-pipeline.ts contains `return { transcribe, synthesize, formatForVoice }`
|
|
- server/src/__tests__/36-voice-pipeline.test.ts exits 0
|
|
- server/package.json contains "ffmpeg-static"
|
|
</acceptance_criteria>
|
|
<done>
|
|
VoicePipelineService exists with transcribe (whisper cascade + language detection), synthesize (piper with sentence chunking + timeout), formatForVoice (SPOKEN extraction + markdown strip fallback), and transcodeToWav16k (ffmpeg pipe). All unit tests pass with mocked child_process.
|
|
</done>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<verification>
|
|
- `pnpm --filter @paperclipai/server test --run src/__tests__/36-voice-pipeline.test.ts` exits 0
|
|
- `grep -c "export function voicePipelineService" server/src/services/voice-pipeline.ts` returns 1
|
|
- `grep "ffmpeg-static" server/package.json` shows dependency present
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
VoicePipelineService is a working, tested factory function that downstream consumers (voice.ts routes in Plan 03, Telegram bridge in Phase 38) can import and call without additional setup. ffmpeg-static is installed. All unit tests pass.
|
|
</success_criteria>
|
|
|
|
<output>
|
|
After completion, create `.planning/phases/36-voice-pipeline-foundation/36-01-SUMMARY.md`
|
|
</output>
|