nexus/.planning/phases/38-telegram-bridge/38-02-PLAN.md at c3689f11b1fffc55907dca670f69b0e486b42e47

mikkel/nexus

Fork 0

Nexus Dev 9959d1b77e docs(38): create 3 plans in 2 waves for Telegram bridge

2026-04-04 03:55:50 +00:00

9.3 KiB

Raw Blame History

phase

plan

type

wave

depends_on

files_modified

autonomous

requirements

must_haves

38-telegram-bridge

execute

38-01

server/src/services/telegram.ts

true

TGRAM-03

TGRAM-04

truths

artifacts

key_links

A voice note sent to the Telegram bot is transcribed and produces an agent text reply

The bot can send back an OGG voice note generated from TTS

path	provides	contains
server/src/services/telegram.ts	Voice message handler (OGG download, transcribe, relay) and TTS reply (synthesize, WAV->OGG, sendVoice)	message:voice

from	to	via	pattern
server/src/services/telegram.ts	server/src/services/voice-pipeline.ts	voicePipelineService().transcribe and synthesize	voicePipelineService.*transcribe\|synthesize

from	to	via	pattern
server/src/services/telegram.ts	Telegram Bot API file download	ctx.getFile() + fetch download URL	ctx.getFile\|api.telegram.org/file

Add voice message handling to the Telegram bridge. Voice notes received are downloaded (OGG), transcribed via VoicePipelineService, and relayed as text. Agent responses can optionally be sent back as OGG voice notes via TTS.

Purpose: Completes the bidirectional voice relay, making the Telegram bridge work for hands-free phone use. Output: Updated server/src/services/telegram.ts with voice handlers

<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/phases/38-telegram-bridge/38-CONTEXT.md @.planning/phases/38-telegram-bridge/38-RESEARCH.md @.planning/phases/38-telegram-bridge/38-01-SUMMARY.md @.planning/REQUIREMENTS.md

From server/src/services/voice-pipeline.ts:

export function voicePipelineService() {
  return {
    async transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer>,
    async transcribe(audioBuffer: Buffer, format?: string): Promise<{ text: string; language?: string }>,
    async synthesize(text: string, voiceId?: string): Promise<Buffer>,  // returns raw PCM s16le
    formatForVoice(text: string): string,  // strips markdown for natural speech
  }
}

From grammy (already installed in Plan 01):

import { Bot, InputFile } from "grammy";
// ctx.getFile() returns { file_path: string }
// Download URL: https://api.telegram.org/file/bot{TOKEN}/{file_path}
// ctx.replyWithVoice(new InputFile(buffer)) sends OGG voice note

ffmpeg transcoding pattern (from voice-pipeline.ts):

// spawn(ffmpegBin, ["-f", "s16le", "-ar", "22050", "-ac", "1", "-i", "pipe:0", ...])
// For WAV->OGG: output args = ["-c:a", "libopus", "-ar", "48000", "-f", "ogg", "pipe:1"]

Task 1: Add voice message handler — OGG download, transcription, text relay server/src/services/telegram.ts - server/src/services/telegram.ts (the file from Plan 01 — understand handler registration pattern, existing text handler) - server/src/services/voice-pipeline.ts (transcribe method signature, transcodeToWav16k) In `server/src/services/telegram.ts`, add a voice message handler:

Import voicePipelineService from "./voice-pipeline.js" and InputFile from "grammy" (if not already)
Add bot.on("message:voice", async (ctx) => { ... }) handler:
- Send immediate "Transcribing..." reply to prevent Telegram resend (Pitfall 1)
- Fire off async processVoiceMessage(ctx, db) — do NOT await in handler body. Catch errors and reply with "Voice transcription failed."
Implement processVoiceMessage(ctx, db):
- Download OGG: const file = await ctx.getFile() then construct URL https://api.telegram.org/file/bot${token}/${file.file_path} — get token from bot instance. Fetch the URL, get arrayBuffer(), convert to Buffer.
- Transcribe: const voiceSvc = voicePipelineService(), then const { text } = await voiceSvc.transcribe(oggBuffer, "ogg"). The transcribe method handles OGG->WAV16k internally via transcodeToWav16k.
- If transcription is empty, reply "Could not transcribe voice message." and return.
- Send transcription confirmation: await ctx.reply("Heard: " + text.slice(0, 200)) (truncate for readability)
- Then relay as text — reuse the SAME text relay logic from the text handler. Extract the text relay logic into a shared relayToAgent(ctx, chatId, userText, db) function that both message:text and voice handlers call.
Refactor: Extract the core relay logic from the existing message:text handler into relayToAgent(ctx, chatId, userText, db) so both text and voice handlers can use it. This function:
- Resolves agent, gets/creates conversation, persists user message, collects LLM stream, prefixes with [AgentName], splits long messages, persists assistant message, sends reply.
CRITICAL: Check total line count stays under 500 (TGRAM-06). The voice handler + refactor should add ~60-80 lines. cd server && pnpm exec tsc --noEmit 2>&1 | head -30 <acceptance_criteria>
- grep -q "message:voice" server/src/services/telegram.ts
- grep -q "voicePipelineService" server/src/services/telegram.ts
- grep -q "transcribe" server/src/services/telegram.ts
- grep -q "ctx.getFile" server/src/services/telegram.ts
- grep -q "Transcribing" server/src/services/telegram.ts
- grep -q "relayToAgent|relayText" server/src/services/telegram.ts
- wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}' </acceptance_criteria> Voice notes sent to the bot are downloaded, transcribed via voicePipelineService, and relayed to the agent as text. Transcription confirmation shown to user. Shared relay function prevents code duplication.

Task 2: Add TTS voice reply — synthesize agent response to OGG voice note server/src/services/telegram.ts - server/src/services/telegram.ts (current state after Task 1) - server/src/services/voice-pipeline.ts (synthesize method — returns raw PCM s16le buffer) In `server/src/services/telegram.ts`, add voice reply capability:

Add a transcodeToOggOpus(rawPcmBuffer: Buffer): Promise<Buffer> helper function:
- Use spawn from node:child_process with ffmpegPath from ffmpeg-static
- Input: raw PCM s16le (Piper output). Use -f s16le -ar 22050 -ac 1 -i pipe:0
- Output: OGG Opus for Telegram. Use -c:a libopus -ar 48000 -f ogg pipe:1
- Collect stdout chunks into Buffer. Reject on non-zero exit code.
- Note on sample rate: Piper en_US-lessac-medium outputs 22050Hz. If the model metadata is available at voicePipelineService, read it. Otherwise hardcode 22050 with a comment noting it must match the Piper model.
Modify the relayToAgent function (or add a post-relay hook):
- After sending the text reply, check if the user's last message was a voice note (pass a voiceMode flag or check context)
- If voice mode: call voiceSvc.formatForVoice(fullResponse) to strip markdown, then voiceSvc.synthesize(voiceText) to get raw PCM, then transcodeToOggOpus(pcmBuffer) to get OGG, then ctx.replyWithVoice(new InputFile(oggBuffer, "response.ogg")).
- Wrap TTS reply in try/catch — if synthesis fails (Piper not installed), log warning and skip voice reply silently. Text reply already sent, so user still gets the response.
Alternative approach (simpler): Always send text reply first. Then if the original message was voice, attempt a voice reply as a bonus. This way failure of TTS never blocks the text response.
Keep total telegram.ts under 500 lines. The TTS reply adds ~40-50 lines. cd server && pnpm exec tsc --noEmit 2>&1 | head -30 <acceptance_criteria>
- grep -q "transcodeToOggOpus|ogg.*opus|libopus" server/src/services/telegram.ts
- grep -q "synthesize" server/src/services/telegram.ts
- grep -q "replyWithVoice|reply_with_voice" server/src/services/telegram.ts
- grep -q "InputFile" server/src/services/telegram.ts
- grep -q "formatForVoice" server/src/services/telegram.ts
- wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}' </acceptance_criteria> Agent responses to voice messages include both a text reply and an OGG voice note. TTS failure degrades gracefully to text-only. telegram.ts remains under 500 lines.

- `cd server && pnpm exec tsc --noEmit` — zero errors - `wc -l server/src/services/telegram.ts` — under 500 lines - `grep -c "message:voice\|transcribe\|synthesize\|replyWithVoice\|InputFile" server/src/services/telegram.ts` — at least 5 matches

<success_criteria>

Voice notes are downloaded from Telegram, transcribed, and relayed to agent as text
Agent responses generate OGG voice notes sent back via Telegram
TTS failure degrades gracefully (text reply still works)
telegram.ts remains under 500 lines total
TypeScript compiles without errors </success_criteria>

After completion, create `.planning/phases/38-telegram-bridge/38-02-SUMMARY.md`

9.3 KiB Raw Blame History

9.3 KiB

Raw Blame History