mirror of
https://github.com/clawdbot/clawdbot.git
synced 2026-01-31 19:37:45 +01:00
3.0 KiB
3.0 KiB
summary, read_when
| summary | read_when | |
|---|---|---|
| How inbound audio/voice notes are downloaded, transcribed, and injected into replies |
|
Audio / Voice Notes — 2026-01-17
What works
- Media understanding (audio): If
tools.media.audiois enabled (or a sharedtools.media.modelsentry supports audio), Clawdbot:- Locates the first audio attachment (local path or URL) and downloads it if needed.
- Enforces
maxBytesbefore sending to each model entry. - Runs the first eligible model entry in order (provider or CLI).
- If it fails or skips (size/timeout), it tries the next entry.
- On success, it replaces
Bodywith an[Audio]block and sets{{Transcript}}.
- Command parsing: When transcription succeeds,
CommandBody/RawBodyare set to the transcript so slash commands still work. - Verbose logging: In
--verbose, we log when transcription runs and when it replaces the body.
Config examples
Provider + CLI fallback (OpenAI + Whisper CLI)
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "whisper-1" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45
}
]
}
}
}
}
Provider-only with scope gating
{
tools: {
media: {
audio: {
enabled: true,
scope: {
default: "allow",
rules: [
{ action: "deny", match: { chatType: "group" } }
]
},
models: [
{ provider: "openai", model: "whisper-1" }
]
}
}
}
}
Provider-only (Deepgram)
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "deepgram", model: "nova-3" }]
}
}
}
}
Notes & limits
- Provider auth follows the standard model auth order (auth profiles, env vars,
models.providers.*.apiKey). - Deepgram picks up
DEEPGRAM_API_KEYwhenprovider: "deepgram"is used. - Deepgram setup details: Deepgram (audio transcription).
- Audio providers can override
baseUrl,headers, andproviderOptionsviatools.media.audio. - Default size cap is 20MB (
tools.media.audio.maxBytes). Oversize audio is skipped for that model and the next entry is tried. - Default
maxCharsfor audio is unset (full transcript). Settools.media.audio.maxCharsor per-entrymaxCharsto trim output. - Use
tools.media.audio.attachmentsto process multiple voice notes (mode: "all"+maxAttachments). - Transcript is available to templates as
{{Transcript}}. - CLI stdout is capped (5MB); keep CLI output concise.
Gotchas
- Scope rules use first-match wins.
chatTypeis normalized todirect,group, orroom. - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via
jq -r .text. - Keep timeouts reasonable (
timeoutSeconds, default 60s) to avoid blocking the reply queue.