112 lines
5.4 KiB
Markdown
112 lines
5.4 KiB
Markdown
---
|
|
phase: 38-telegram-bridge
|
|
plan: "02"
|
|
subsystem: api
|
|
tags: [telegram, grammy, voice, whisper, piper, ogg, ffmpeg, tts, stt]
|
|
|
|
requires:
|
|
- phase: 38-01
|
|
provides: telegramService factory with text relay, session map, bot lifecycle
|
|
- phase: 36-voice-pipeline
|
|
provides: voicePipelineService (transcribe, synthesize, formatForVoice, transcodeToWav16k)
|
|
|
|
provides:
|
|
- Voice message handler: OGG download via ctx.getFile(), transcription via voicePipelineService
|
|
- Shared relayToAgent() function used by both text and voice message handlers
|
|
- transcodeToOggOpus() helper: raw PCM s16le (Piper 22050Hz) -> OGG Opus 48000Hz for Telegram
|
|
- TTS voice reply: agent responses synthesized to OGG voice note via ctx.replyWithVoice()
|
|
- Graceful TTS degradation: text reply always sent first; voice is a bonus that silently fails
|
|
|
|
affects:
|
|
- 38-03 (onboarding step unchanged — already uses POST /telegram/token)
|
|
|
|
tech-stack:
|
|
added: []
|
|
patterns:
|
|
- "Immediate 'Transcribing...' reply prevents Telegram update resend (Pitfall 1)"
|
|
- "Fire-and-forget async: processVoiceMessage() not awaited inside handler body"
|
|
- "Shared relayToAgent(ctx, chatId, userText, db, voiceMode) eliminates duplicate relay logic"
|
|
- "TTS reply wrapped in try/catch — voice failure never blocks text response"
|
|
- "transcodeToOggOpus uses same ffmpeg spawn pattern as voice-pipeline.ts"
|
|
|
|
key-files:
|
|
created: []
|
|
modified:
|
|
- server/src/services/telegram.ts
|
|
|
|
key-decisions:
|
|
- "Both tasks implemented together in one atomic file write — Task 1 (voice handler + relay refactor) and Task 2 (TTS reply) both modify telegram.ts; committing as one coherent change"
|
|
- "processVoiceMessage() extracted as top-level async function — keeps bot handler clean and makes error handling explicit"
|
|
- "voiceMode flag passed to relayToAgent() rather than checking ctx type — simpler and avoids grammy type gymnastics"
|
|
- "botToken stored as module-level mutable ref (botToken = token) in start() — processVoiceMessage needs token for CDN URL construction"
|
|
- "Piper hardcoded to 22050Hz in transcodeToOggOpus with comment — matches en_US-lessac-medium model spec"
|
|
|
|
metrics:
|
|
duration: 10min
|
|
completed: 2026-04-04
|
|
tasks_completed: 2
|
|
tasks_total: 2
|
|
files_modified: 1
|
|
---
|
|
|
|
# Phase 38 Plan 02: Telegram Voice Handling Summary
|
|
|
|
**OGG download + Whisper transcription + Piper TTS reply wired into existing telegramService, with shared relayToAgent() function and graceful voice degradation**
|
|
|
|
## Performance
|
|
|
|
- **Duration:** ~10 min
|
|
- **Completed:** 2026-04-04
|
|
- **Tasks:** 2 of 2
|
|
- **Files modified:** 1
|
|
|
|
## Accomplishments
|
|
|
|
- Refactored text relay into shared `relayToAgent(ctx, chatId, userText, db, voiceMode)` — eliminates duplicate logic between text and voice handlers
|
|
- Added `bot.on("message:voice", ...)` handler that sends immediate "Transcribing..." reply (prevents Telegram resend) and processes async
|
|
- `processVoiceMessage()`: downloads OGG from Telegram CDN via `ctx.getFile()` + fetch, transcribes via `voicePipelineService().transcribe(oggBuffer, "ogg")`, sends "Heard: ..." confirmation, relays to agent
|
|
- `transcodeToOggOpus()`: uses ffmpeg-static spawn pattern to convert raw PCM s16le (Piper 22050Hz) to OGG Opus 48000Hz for Telegram voice notes
|
|
- TTS voice reply: after text reply, calls `voiceSvc.formatForVoice()` + `synthesize()` + `transcodeToOggOpus()` + `ctx.replyWithVoice(InputFile(...))` — wrapped in try/catch so Piper unavailability degrades silently
|
|
|
|
## Task Commits
|
|
|
|
1. **Task 1 + Task 2: Voice handler + TTS reply** - `e7205724` (feat) — both tasks in single atomic commit (same file)
|
|
|
|
## Files Created/Modified
|
|
|
|
- `server/src/services/telegram.ts` (322 lines, was 187) — voice handler, relayToAgent(), transcodeToOggOpus(), TTS reply
|
|
|
|
## Decisions Made
|
|
|
|
- `botToken` stored as module-level mutable ref alongside `bot` — processVoiceMessage() needs token string to construct the Telegram CDN download URL
|
|
- `voiceMode = false` default parameter on `relayToAgent()` — text handler calls without flag, voice handler passes `true`
|
|
- TTS failure is a warning (not an error) — voice reply is bonus feature, text always delivered first
|
|
- `transcodeToOggOpus` hardcodes 22050Hz input rate with explanatory comment — matches Piper `en_US-lessac-medium` output spec
|
|
|
|
## Deviations from Plan
|
|
|
|
### Minor adjustments
|
|
|
|
**1. [Rule 1 - Structural] Tasks 1 and 2 implemented and committed together**
|
|
- **Found during:** Task 1 planning
|
|
- **Issue:** Both tasks modify the same file; splitting into two commits would require an intermediate state where voice handler exists without TTS, which is not a meaningful checkpoint
|
|
- **Fix:** Single commit covers both tasks; commit message documents both additions
|
|
- **Files modified:** server/src/services/telegram.ts
|
|
- **Commit:** e7205724
|
|
|
|
## Known Stubs
|
|
|
|
None — voice relay is fully wired:
|
|
- OGG download: real Telegram CDN fetch via `ctx.getFile()`
|
|
- Transcription: real `voicePipelineService().transcribe()` (Whisper)
|
|
- TTS synthesis: real `voicePipelineService().synthesize()` (Piper)
|
|
- Voice reply: real `ctx.replyWithVoice(InputFile(oggBuffer))`
|
|
- Text relay: real `puterProxyService().chatStream()` (same as Plan 01)
|
|
|
|
The only runtime dependency is Whisper/Piper availability — both degrade gracefully with informative error messages.
|
|
|
|
## Self-Check: PASSED
|
|
|
|
- File exists: server/src/services/telegram.ts (322 lines) ✓
|
|
- Commit e7205724 exists ✓
|
|
- All acceptance criteria passing ✓
|