9.3 KiB
| phase | plan | type | wave | depends_on | files_modified | autonomous | requirements | must_haves | |||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38-telegram-bridge | 02 | execute | 2 |
|
|
true |
|
|
Purpose: Completes the bidirectional voice relay, making the Telegram bridge work for hands-free phone use.
Output: Updated server/src/services/telegram.ts with voice handlers
<execution_context> @$HOME/.claude/get-shit-done/workflows/execute-plan.md @$HOME/.claude/get-shit-done/templates/summary.md </execution_context>
@.planning/phases/38-telegram-bridge/38-CONTEXT.md @.planning/phases/38-telegram-bridge/38-RESEARCH.md @.planning/phases/38-telegram-bridge/38-01-SUMMARY.md @.planning/REQUIREMENTS.mdFrom server/src/services/voice-pipeline.ts:
export function voicePipelineService() {
return {
async transcodeToWav16k(inputBuffer: Buffer, inputFormat: string): Promise<Buffer>,
async transcribe(audioBuffer: Buffer, format?: string): Promise<{ text: string; language?: string }>,
async synthesize(text: string, voiceId?: string): Promise<Buffer>, // returns raw PCM s16le
formatForVoice(text: string): string, // strips markdown for natural speech
}
}
From grammy (already installed in Plan 01):
import { Bot, InputFile } from "grammy";
// ctx.getFile() returns { file_path: string }
// Download URL: https://api.telegram.org/file/bot{TOKEN}/{file_path}
// ctx.replyWithVoice(new InputFile(buffer)) sends OGG voice note
ffmpeg transcoding pattern (from voice-pipeline.ts):
// spawn(ffmpegBin, ["-f", "s16le", "-ar", "22050", "-ac", "1", "-i", "pipe:0", ...])
// For WAV->OGG: output args = ["-c:a", "libopus", "-ar", "48000", "-f", "ogg", "pipe:1"]
Task 1: Add voice message handler — OGG download, transcription, text relay
server/src/services/telegram.ts
- server/src/services/telegram.ts (the file from Plan 01 — understand handler registration pattern, existing text handler)
- server/src/services/voice-pipeline.ts (transcribe method signature, transcodeToWav16k)
In `server/src/services/telegram.ts`, add a voice message handler:
-
Import
voicePipelineServicefrom"./voice-pipeline.js"andInputFilefrom "grammy" (if not already) -
Add
bot.on("message:voice", async (ctx) => { ... })handler:- Send immediate "Transcribing..." reply to prevent Telegram resend (Pitfall 1)
- Fire off async
processVoiceMessage(ctx, db)— do NOT await in handler body. Catch errors and reply with "Voice transcription failed."
-
Implement
processVoiceMessage(ctx, db):- Download OGG:
const file = await ctx.getFile()then construct URLhttps://api.telegram.org/file/bot${token}/${file.file_path}— get token from bot instance. Fetch the URL, getarrayBuffer(), convert to Buffer. - Transcribe:
const voiceSvc = voicePipelineService(), thenconst { text } = await voiceSvc.transcribe(oggBuffer, "ogg"). The transcribe method handles OGG->WAV16k internally via transcodeToWav16k. - If transcription is empty, reply "Could not transcribe voice message." and return.
- Send transcription confirmation:
await ctx.reply("Heard: " + text.slice(0, 200))(truncate for readability) - Then relay as text — reuse the SAME text relay logic from the text handler. Extract the text relay logic into a shared
relayToAgent(ctx, chatId, userText, db)function that bothmessage:textand voice handlers call.
- Download OGG:
-
Refactor: Extract the core relay logic from the existing
message:texthandler intorelayToAgent(ctx, chatId, userText, db)so both text and voice handlers can use it. This function:- Resolves agent, gets/creates conversation, persists user message, collects LLM stream, prefixes with [AgentName], splits long messages, persists assistant message, sends reply.
-
CRITICAL: Check total line count stays under 500 (TGRAM-06). The voice handler + refactor should add ~60-80 lines. cd server && pnpm exec tsc --noEmit 2>&1 | head -30 <acceptance_criteria>
- grep -q "message:voice" server/src/services/telegram.ts
- grep -q "voicePipelineService" server/src/services/telegram.ts
- grep -q "transcribe" server/src/services/telegram.ts
- grep -q "ctx.getFile" server/src/services/telegram.ts
- grep -q "Transcribing" server/src/services/telegram.ts
- grep -q "relayToAgent|relayText" server/src/services/telegram.ts
- wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}' </acceptance_criteria> Voice notes sent to the bot are downloaded, transcribed via voicePipelineService, and relayed to the agent as text. Transcription confirmation shown to user. Shared relay function prevents code duplication.
-
Add a
transcodeToOggOpus(rawPcmBuffer: Buffer): Promise<Buffer>helper function:- Use
spawnfromnode:child_processwithffmpegPathfromffmpeg-static - Input: raw PCM s16le (Piper output). Use
-f s16le -ar 22050 -ac 1 -i pipe:0 - Output: OGG Opus for Telegram. Use
-c:a libopus -ar 48000 -f ogg pipe:1 - Collect stdout chunks into Buffer. Reject on non-zero exit code.
- Note on sample rate: Piper
en_US-lessac-mediumoutputs 22050Hz. If the model metadata is available atvoicePipelineService, read it. Otherwise hardcode 22050 with a comment noting it must match the Piper model.
- Use
-
Modify the
relayToAgentfunction (or add a post-relay hook):- After sending the text reply, check if the user's last message was a voice note (pass a
voiceModeflag or check context) - If voice mode: call
voiceSvc.formatForVoice(fullResponse)to strip markdown, thenvoiceSvc.synthesize(voiceText)to get raw PCM, thentranscodeToOggOpus(pcmBuffer)to get OGG, thenctx.replyWithVoice(new InputFile(oggBuffer, "response.ogg")). - Wrap TTS reply in try/catch — if synthesis fails (Piper not installed), log warning and skip voice reply silently. Text reply already sent, so user still gets the response.
- After sending the text reply, check if the user's last message was a voice note (pass a
-
Alternative approach (simpler): Always send text reply first. Then if the original message was voice, attempt a voice reply as a bonus. This way failure of TTS never blocks the text response.
-
Keep total telegram.ts under 500 lines. The TTS reply adds ~40-50 lines. cd server && pnpm exec tsc --noEmit 2>&1 | head -30 <acceptance_criteria>
- grep -q "transcodeToOggOpus|ogg.*opus|libopus" server/src/services/telegram.ts
- grep -q "synthesize" server/src/services/telegram.ts
- grep -q "replyWithVoice|reply_with_voice" server/src/services/telegram.ts
- grep -q "InputFile" server/src/services/telegram.ts
- grep -q "formatForVoice" server/src/services/telegram.ts
- wc -l < server/src/services/telegram.ts | awk '{exit ($1 > 500)}' </acceptance_criteria> Agent responses to voice messages include both a text reply and an OGG voice note. TTS failure degrades gracefully to text-only. telegram.ts remains under 500 lines.
<success_criteria>
- Voice notes are downloaded from Telegram, transcribed, and relayed to agent as text
- Agent responses generate OGG voice notes sent back via Telegram
- TTS failure degrades gracefully (text reply still works)
- telegram.ts remains under 500 lines total
- TypeScript compiles without errors </success_criteria>