Nexus Dev 262af05116 docs(38-02): complete Telegram voice handling plan — OGG download + Whisper STT + Piper TTS reply

2026-04-04 03:55:50 +00:00

5.4 KiB

Raw Blame History

phase

plan

subsystem

tags

requires

provides

affects

tech-stack

key-files

key-decisions

metrics

38-telegram-bridge

api

grammy

voice

whisper

piper

ogg

ffmpeg

tts

stt

phase	provides
38-01	telegramService factory with text relay, session map, bot lifecycle

phase	provides
36-voice-pipeline	voicePipelineService (transcribe, synthesize, formatForVoice, transcodeToWav16k)

Voice message handler
OGG download via ctx.getFile(), transcription via voicePipelineService

Shared relayToAgent() function used by both text and voice message handlers

transcodeToOggOpus() helper
raw PCM s16le (Piper 22050Hz) -> OGG Opus 48000Hz for Telegram

TTS voice reply
agent responses synthesized to OGG voice note via ctx.replyWithVoice()

Graceful TTS degradation
text reply always sent first; voice is a bonus that silently fails

38-03 (onboarding step unchanged — already uses POST /telegram/token)

added

patterns

Immediate 'Transcribing...' reply prevents Telegram update resend (Pitfall 1)

Fire-and-forget async: processVoiceMessage() not awaited inside handler body

Shared relayToAgent(ctx, chatId, userText, db, voiceMode) eliminates duplicate relay logic

TTS reply wrapped in try/catch — voice failure never blocks text response

transcodeToOggOpus uses same ffmpeg spawn pattern as voice-pipeline.ts

created

modified

server/src/services/telegram.ts

Both tasks implemented together in one atomic file write — Task 1 (voice handler + relay refactor) and Task 2 (TTS reply) both modify telegram.ts; committing as one coherent change

processVoiceMessage() extracted as top-level async function — keeps bot handler clean and makes error handling explicit

voiceMode flag passed to relayToAgent() rather than checking ctx type — simpler and avoids grammy type gymnastics

botToken stored as module-level mutable ref (botToken = token) in start() — processVoiceMessage needs token for CDN URL construction

Piper hardcoded to 22050Hz in transcodeToOggOpus with comment — matches en_US-lessac-medium model spec

duration	completed	tasks_completed	tasks_total	files_modified
10min	2026-04-04	2	2	1

Phase 38 Plan 02: Telegram Voice Handling Summary

OGG download + Whisper transcription + Piper TTS reply wired into existing telegramService, with shared relayToAgent() function and graceful voice degradation

Performance

Duration: ~10 min
Completed: 2026-04-04
Tasks: 2 of 2
Files modified: 1

Accomplishments

Refactored text relay into shared relayToAgent(ctx, chatId, userText, db, voiceMode) — eliminates duplicate logic between text and voice handlers
Added bot.on("message:voice", ...) handler that sends immediate "Transcribing..." reply (prevents Telegram resend) and processes async
processVoiceMessage(): downloads OGG from Telegram CDN via ctx.getFile() + fetch, transcribes via voicePipelineService().transcribe(oggBuffer, "ogg"), sends "Heard: ..." confirmation, relays to agent
transcodeToOggOpus(): uses ffmpeg-static spawn pattern to convert raw PCM s16le (Piper 22050Hz) to OGG Opus 48000Hz for Telegram voice notes
TTS voice reply: after text reply, calls voiceSvc.formatForVoice() + synthesize() + transcodeToOggOpus() + ctx.replyWithVoice(InputFile(...)) — wrapped in try/catch so Piper unavailability degrades silently

Task Commits

Task 1 + Task 2: Voice handler + TTS reply - e7205724 (feat) — both tasks in single atomic commit (same file)

Files Created/Modified

server/src/services/telegram.ts (322 lines, was 187) — voice handler, relayToAgent(), transcodeToOggOpus(), TTS reply

Decisions Made

botToken stored as module-level mutable ref alongside bot — processVoiceMessage() needs token string to construct the Telegram CDN download URL
voiceMode = false default parameter on relayToAgent() — text handler calls without flag, voice handler passes true
TTS failure is a warning (not an error) — voice reply is bonus feature, text always delivered first
transcodeToOggOpus hardcodes 22050Hz input rate with explanatory comment — matches Piper en_US-lessac-medium output spec

Deviations from Plan

Minor adjustments

1. [Rule 1 - Structural] Tasks 1 and 2 implemented and committed together

Found during: Task 1 planning
Issue: Both tasks modify the same file; splitting into two commits would require an intermediate state where voice handler exists without TTS, which is not a meaningful checkpoint
Fix: Single commit covers both tasks; commit message documents both additions
Files modified: server/src/services/telegram.ts
Commit: e7205724

Known Stubs

None — voice relay is fully wired:

OGG download: real Telegram CDN fetch via ctx.getFile()
Transcription: real voicePipelineService().transcribe() (Whisper)
TTS synthesis: real voicePipelineService().synthesize() (Piper)
Voice reply: real ctx.replyWithVoice(InputFile(oggBuffer))
Text relay: real puterProxyService().chatStream() (same as Plan 01)

The only runtime dependency is Whisper/Piper availability — both degrade gracefully with informative error messages.

Self-Check: PASSED

File exists: server/src/services/telegram.ts (322 lines) ✓
Commit e7205724 exists ✓
All acceptance criteria passing ✓

5.4 KiB Raw Blame History