nexus/.planning/phases/38-telegram-bridge/38-02-PLAN.md

183 lines
9.3 KiB
Markdown

---
phase: 38-telegram-bridge
plan: 02
type: execute
wave: 2
depends_on: ["38-01"]
files_modified:
- server/src/services/telegram.ts
autonomous: true
requirements: [TGRAM-03, TGRAM-04]
must_haves:
truths:
- "A voice note sent to the Telegram bot is transcribed and produces an agent text reply"
- "The bot can send back an OGG voice note generated from TTS"
artifacts:
- path: "server/src/services/telegram.ts"
provides: "Voice message handler (OGG download, transcribe, relay) and TTS reply (synthesize, WAV->OGG, sendVoice)"
contains: "message:voice"
key_links:
- from: "server/src/services/telegram.ts"
to: "server/src/services/voice-pipeline.ts"
via: "voicePipelineService().transcribe and synthesize"
pattern: "voicePipelineService.*transcribe|synthesize"
- from: "server/src/services/telegram.ts"
to: "Telegram Bot API file download"
via: "ctx.getFile() + fetch download URL"
pattern: "ctx\\.getFile|api\\.telegram\\.org/file"
---
<objective>
Add voice message handling to the Telegram bridge. Voice notes received are downloaded (OGG), transcribed via VoicePipelineService, and relayed as text. Agent responses can optionally be sent back as OGG voice notes via TTS.
Purpose: Completes the bidirectional voice relay, making the Telegram bridge work for hands-free phone use.
Output: Updated `server/src/services/telegram.ts` with voice handlers
</objective>
<execution_context>
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
@$HOME/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/phases/38-telegram-bridge/38-CONTEXT.md
@.planning/phases/38-telegram-bridge/38-RESEARCH.md
@.planning/phases/38-telegram-bridge/38-01-SUMMARY.md
@.planning/REQUIREMENTS.md
<interfaces>
<!-- Voice pipeline service from Phase 36 -->
From server/src/services/voice-pipeline.ts:
```typescript
export function voicePipelineService() {
return {
async transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer>,
async transcribe(audioBuffer: Buffer, format?: string): Promise<{ text: string; language?: string }>,
async synthesize(text: string, voiceId?: string): Promise<Buffer>, // returns raw PCM s16le
formatForVoice(text: string): string, // strips markdown for natural speech
}
}
```
From grammy (already installed in Plan 01):
```typescript
import { Bot, InputFile } from "grammy";
// ctx.getFile() returns { file_path: string }
// Download URL: https://api.telegram.org/file/bot{TOKEN}/{file_path}
// ctx.replyWithVoice(new InputFile(buffer)) sends OGG voice note
```
ffmpeg transcoding pattern (from voice-pipeline.ts):
```typescript
// spawn(ffmpegBin, ["-f", "s16le", "-ar", "22050", "-ac", "1", "-i", "pipe:0", ...])
// For WAV->OGG: output args = ["-c:a", "libopus", "-ar", "48000", "-f", "ogg", "pipe:1"]
```
</interfaces>
</context>
<tasks>
<task type="auto">
<name>Task 1: Add voice message handler — OGG download, transcription, text relay</name>
<files>server/src/services/telegram.ts</files>
<read_first>
- server/src/services/telegram.ts (the file from Plan 01 — understand handler registration pattern, existing text handler)
- server/src/services/voice-pipeline.ts (transcribe method signature, transcodeToWav16k)
</read_first>
<action>
In `server/src/services/telegram.ts`, add a voice message handler:
1. Import `voicePipelineService` from `"./voice-pipeline.js"` and `InputFile` from "grammy" (if not already)
2. Add `bot.on("message:voice", async (ctx) => { ... })` handler:
- Send immediate "Transcribing..." reply to prevent Telegram resend (Pitfall 1)
- Fire off async `processVoiceMessage(ctx, db)` — do NOT await in handler body. Catch errors and reply with "Voice transcription failed."
3. Implement `processVoiceMessage(ctx, db)`:
- Download OGG: `const file = await ctx.getFile()` then construct URL `https://api.telegram.org/file/bot${token}/${file.file_path}` — get token from bot instance. Fetch the URL, get `arrayBuffer()`, convert to Buffer.
- Transcribe: `const voiceSvc = voicePipelineService()`, then `const { text } = await voiceSvc.transcribe(oggBuffer, "ogg")`. The transcribe method handles OGG->WAV16k internally via transcodeToWav16k.
- If transcription is empty, reply "Could not transcribe voice message." and return.
- Send transcription confirmation: `await ctx.reply("Heard: " + text.slice(0, 200))` (truncate for readability)
- Then relay as text — reuse the SAME text relay logic from the text handler. Extract the text relay logic into a shared `relayToAgent(ctx, chatId, userText, db)` function that both `message:text` and voice handlers call.
4. Refactor: Extract the core relay logic from the existing `message:text` handler into `relayToAgent(ctx, chatId, userText, db)` so both text and voice handlers can use it. This function:
- Resolves agent, gets/creates conversation, persists user message, collects LLM stream, prefixes with [AgentName], splits long messages, persists assistant message, sends reply.
5. CRITICAL: Check total line count stays under 500 (TGRAM-06). The voice handler + refactor should add ~60-80 lines.
</action>
<verify>
<automated>cd server && pnpm exec tsc --noEmit 2>&1 | head -30</automated>
</verify>
<acceptance_criteria>
- grep -q "message:voice" server/src/services/telegram.ts
- grep -q "voicePipelineService" server/src/services/telegram.ts
- grep -q "transcribe" server/src/services/telegram.ts
- grep -q "ctx.getFile" server/src/services/telegram.ts
- grep -q "Transcribing" server/src/services/telegram.ts
- grep -q "relayToAgent\|relayText" server/src/services/telegram.ts
- wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}'
</acceptance_criteria>
<done>Voice notes sent to the bot are downloaded, transcribed via voicePipelineService, and relayed to the agent as text. Transcription confirmation shown to user. Shared relay function prevents code duplication.</done>
</task>
<task type="auto">
<name>Task 2: Add TTS voice reply — synthesize agent response to OGG voice note</name>
<files>server/src/services/telegram.ts</files>
<read_first>
- server/src/services/telegram.ts (current state after Task 1)
- server/src/services/voice-pipeline.ts (synthesize method — returns raw PCM s16le buffer)
</read_first>
<action>
In `server/src/services/telegram.ts`, add voice reply capability:
1. Add a `transcodeToOggOpus(rawPcmBuffer: Buffer): Promise<Buffer>` helper function:
- Use `spawn` from `node:child_process` with `ffmpegPath` from `ffmpeg-static`
- Input: raw PCM s16le (Piper output). Use `-f s16le -ar 22050 -ac 1 -i pipe:0`
- Output: OGG Opus for Telegram. Use `-c:a libopus -ar 48000 -f ogg pipe:1`
- Collect stdout chunks into Buffer. Reject on non-zero exit code.
- Note on sample rate: Piper `en_US-lessac-medium` outputs 22050Hz. If the model metadata is available at `voicePipelineService`, read it. Otherwise hardcode 22050 with a comment noting it must match the Piper model.
2. Modify the `relayToAgent` function (or add a post-relay hook):
- After sending the text reply, check if the user's last message was a voice note (pass a `voiceMode` flag or check context)
- If voice mode: call `voiceSvc.formatForVoice(fullResponse)` to strip markdown, then `voiceSvc.synthesize(voiceText)` to get raw PCM, then `transcodeToOggOpus(pcmBuffer)` to get OGG, then `ctx.replyWithVoice(new InputFile(oggBuffer, "response.ogg"))`.
- Wrap TTS reply in try/catch — if synthesis fails (Piper not installed), log warning and skip voice reply silently. Text reply already sent, so user still gets the response.
3. Alternative approach (simpler): Always send text reply first. Then if the original message was voice, attempt a voice reply as a bonus. This way failure of TTS never blocks the text response.
4. Keep total telegram.ts under 500 lines. The TTS reply adds ~40-50 lines.
</action>
<verify>
<automated>cd server && pnpm exec tsc --noEmit 2>&1 | head -30</automated>
</verify>
<acceptance_criteria>
- grep -q "transcodeToOggOpus\|ogg.*opus\|libopus" server/src/services/telegram.ts
- grep -q "synthesize" server/src/services/telegram.ts
- grep -q "replyWithVoice\|reply_with_voice" server/src/services/telegram.ts
- grep -q "InputFile" server/src/services/telegram.ts
- grep -q "formatForVoice" server/src/services/telegram.ts
- wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}'
</acceptance_criteria>
<done>Agent responses to voice messages include both a text reply and an OGG voice note. TTS failure degrades gracefully to text-only. telegram.ts remains under 500 lines.</done>
</task>
</tasks>
<verification>
- `cd server && pnpm exec tsc --noEmit` — zero errors
- `wc -l server/src/services/telegram.ts` — under 500 lines
- `grep -c "message:voice\|transcribe\|synthesize\|replyWithVoice\|InputFile" server/src/services/telegram.ts` — at least 5 matches
</verification>
<success_criteria>
- Voice notes are downloaded from Telegram, transcribed, and relayed to agent as text
- Agent responses generate OGG voice notes sent back via Telegram
- TTS failure degrades gracefully (text reply still works)
- telegram.ts remains under 500 lines total
- TypeScript compiles without errors
</success_criteria>
<output>
After completion, create `.planning/phases/38-telegram-bridge/38-02-SUMMARY.md`
</output>