nexus/.planning/phases/38-telegram-bridge/38-02-PLAN.md

9.3 KiB

phase plan type wave depends_on files_modified autonomous requirements must_haves
38-telegram-bridge 02 execute 2
38-01
server/src/services/telegram.ts
true
TGRAM-03
TGRAM-04
truths artifacts key_links
A voice note sent to the Telegram bot is transcribed and produces an agent text reply
The bot can send back an OGG voice note generated from TTS
path provides contains
server/src/services/telegram.ts Voice message handler (OGG download, transcribe, relay) and TTS reply (synthesize, WAV->OGG, sendVoice) message:voice
from to via pattern
server/src/services/telegram.ts server/src/services/voice-pipeline.ts voicePipelineService().transcribe and synthesize voicePipelineService.*transcribe|synthesize
from to via pattern
server/src/services/telegram.ts Telegram Bot API file download ctx.getFile() + fetch download URL ctx.getFile|api.telegram.org/file
Add voice message handling to the Telegram bridge. Voice notes received are downloaded (OGG), transcribed via VoicePipelineService, and relayed as text. Agent responses can optionally be sent back as OGG voice notes via TTS.

Purpose: Completes the bidirectional voice relay, making the Telegram bridge work for hands-free phone use. Output: Updated server/src/services/telegram.ts with voice handlers

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/phases/38-telegram-bridge/38-CONTEXT.md @.planning/phases/38-telegram-bridge/38-RESEARCH.md @.planning/phases/38-telegram-bridge/38-01-SUMMARY.md @.planning/REQUIREMENTS.md

From server/src/services/voice-pipeline.ts:

export function voicePipelineService() {
  return {
    async transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer>,
    async transcribe(audioBuffer: Buffer, format?: string): Promise<{ text: string; language?: string }>,
    async synthesize(text: string, voiceId?: string): Promise<Buffer>,  // returns raw PCM s16le
    formatForVoice(text: string): string,  // strips markdown for natural speech
  }
}

From grammy (already installed in Plan 01):

import { Bot, InputFile } from "grammy";
// ctx.getFile() returns { file_path: string }
// Download URL: https://api.telegram.org/file/bot{TOKEN}/{file_path}
// ctx.replyWithVoice(new InputFile(buffer)) sends OGG voice note

ffmpeg transcoding pattern (from voice-pipeline.ts):

// spawn(ffmpegBin, ["-f", "s16le", "-ar", "22050", "-ac", "1", "-i", "pipe:0", ...])
// For WAV->OGG: output args = ["-c:a", "libopus", "-ar", "48000", "-f", "ogg", "pipe:1"]
Task 1: Add voice message handler — OGG download, transcription, text relay server/src/services/telegram.ts - server/src/services/telegram.ts (the file from Plan 01 — understand handler registration pattern, existing text handler) - server/src/services/voice-pipeline.ts (transcribe method signature, transcodeToWav16k) In `server/src/services/telegram.ts`, add a voice message handler:
  1. Import voicePipelineService from "./voice-pipeline.js" and InputFile from "grammy" (if not already)

  2. Add bot.on("message:voice", async (ctx) => { ... }) handler:

    • Send immediate "Transcribing..." reply to prevent Telegram resend (Pitfall 1)
    • Fire off async processVoiceMessage(ctx, db) — do NOT await in handler body. Catch errors and reply with "Voice transcription failed."
  3. Implement processVoiceMessage(ctx, db):

    • Download OGG: const file = await ctx.getFile() then construct URL https://api.telegram.org/file/bot${token}/${file.file_path} — get token from bot instance. Fetch the URL, get arrayBuffer(), convert to Buffer.
    • Transcribe: const voiceSvc = voicePipelineService(), then const { text } = await voiceSvc.transcribe(oggBuffer, "ogg"). The transcribe method handles OGG->WAV16k internally via transcodeToWav16k.
    • If transcription is empty, reply "Could not transcribe voice message." and return.
    • Send transcription confirmation: await ctx.reply("Heard: " + text.slice(0, 200)) (truncate for readability)
    • Then relay as text — reuse the SAME text relay logic from the text handler. Extract the text relay logic into a shared relayToAgent(ctx, chatId, userText, db) function that both message:text and voice handlers call.
  4. Refactor: Extract the core relay logic from the existing message:text handler into relayToAgent(ctx, chatId, userText, db) so both text and voice handlers can use it. This function:

    • Resolves agent, gets/creates conversation, persists user message, collects LLM stream, prefixes with [AgentName], splits long messages, persists assistant message, sends reply.
  5. CRITICAL: Check total line count stays under 500 (TGRAM-06). The voice handler + refactor should add ~60-80 lines. cd server && pnpm exec tsc --noEmit 2>&1 | head -30 <acceptance_criteria>

    • grep -q "message:voice" server/src/services/telegram.ts
    • grep -q "voicePipelineService" server/src/services/telegram.ts
    • grep -q "transcribe" server/src/services/telegram.ts
    • grep -q "ctx.getFile" server/src/services/telegram.ts
    • grep -q "Transcribing" server/src/services/telegram.ts
    • grep -q "relayToAgent|relayText" server/src/services/telegram.ts
    • wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}' </acceptance_criteria> Voice notes sent to the bot are downloaded, transcribed via voicePipelineService, and relayed to the agent as text. Transcription confirmation shown to user. Shared relay function prevents code duplication.
Task 2: Add TTS voice reply — synthesize agent response to OGG voice note server/src/services/telegram.ts - server/src/services/telegram.ts (current state after Task 1) - server/src/services/voice-pipeline.ts (synthesize method — returns raw PCM s16le buffer) In `server/src/services/telegram.ts`, add voice reply capability:
  1. Add a transcodeToOggOpus(rawPcmBuffer: Buffer): Promise<Buffer> helper function:

    • Use spawn from node:child_process with ffmpegPath from ffmpeg-static
    • Input: raw PCM s16le (Piper output). Use -f s16le -ar 22050 -ac 1 -i pipe:0
    • Output: OGG Opus for Telegram. Use -c:a libopus -ar 48000 -f ogg pipe:1
    • Collect stdout chunks into Buffer. Reject on non-zero exit code.
    • Note on sample rate: Piper en_US-lessac-medium outputs 22050Hz. If the model metadata is available at voicePipelineService, read it. Otherwise hardcode 22050 with a comment noting it must match the Piper model.
  2. Modify the relayToAgent function (or add a post-relay hook):

    • After sending the text reply, check if the user's last message was a voice note (pass a voiceMode flag or check context)
    • If voice mode: call voiceSvc.formatForVoice(fullResponse) to strip markdown, then voiceSvc.synthesize(voiceText) to get raw PCM, then transcodeToOggOpus(pcmBuffer) to get OGG, then ctx.replyWithVoice(new InputFile(oggBuffer, "response.ogg")).
    • Wrap TTS reply in try/catch — if synthesis fails (Piper not installed), log warning and skip voice reply silently. Text reply already sent, so user still gets the response.
  3. Alternative approach (simpler): Always send text reply first. Then if the original message was voice, attempt a voice reply as a bonus. This way failure of TTS never blocks the text response.

  4. Keep total telegram.ts under 500 lines. The TTS reply adds ~40-50 lines. cd server && pnpm exec tsc --noEmit 2>&1 | head -30 <acceptance_criteria>

    • grep -q "transcodeToOggOpus|ogg.*opus|libopus" server/src/services/telegram.ts
    • grep -q "synthesize" server/src/services/telegram.ts
    • grep -q "replyWithVoice|reply_with_voice" server/src/services/telegram.ts
    • grep -q "InputFile" server/src/services/telegram.ts
    • grep -q "formatForVoice" server/src/services/telegram.ts
    • wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}' </acceptance_criteria> Agent responses to voice messages include both a text reply and an OGG voice note. TTS failure degrades gracefully to text-only. telegram.ts remains under 500 lines.
- `cd server && pnpm exec tsc --noEmit` — zero errors - `wc -l server/src/services/telegram.ts` — under 500 lines - `grep -c "message:voice\|transcribe\|synthesize\|replyWithVoice\|InputFile" server/src/services/telegram.ts` — at least 5 matches

<success_criteria>

  • Voice notes are downloaded from Telegram, transcribed, and relayed to agent as text
  • Agent responses generate OGG voice notes sent back via Telegram
  • TTS failure degrades gracefully (text reply still works)
  • telegram.ts remains under 500 lines total
  • TypeScript compiles without errors </success_criteria>
After completion, create `.planning/phases/38-telegram-bridge/38-02-SUMMARY.md`