feat: add inbound media understanding

Co-authored-by: Tristan Manchester <tmanchester96@gmail.com>
2026-01-31 19:37:45 +01:00 · 2026-01-17 03:52:37 +00:00
parent 4b749f1b8f
commit 1b973f7506
42 changed files with 2547 additions and 101 deletions
--- a/docs/nodes/audio.md
+++ b/docs/nodes/audio.md
@@ -3,25 +3,59 @@ summary: "How inbound audio/voice notes are downloaded, transcribed, and injecte
 read_when:
  - Changing audio transcription or media handling
 ---
-# Audio / Voice Notes — 2025-12-05
+# Audio / Voice Notes — 2026-01-17

 ## What works
- **Optional transcription**: If `tools.audio.transcription` is set in `~/.clawdbot/clawdbot.json`, Clawdbot will:
-  1) Download inbound audio to a temp path when WhatsApp only provides a URL.
-  2) Run the configured CLI args (templated with `{{MediaPath}}`), expecting transcript on stdout.
-  3) Replace `Body` with the transcript, set `{{Transcript}}`, and prepend the original media path plus a `Transcript:` section in the command prompt so models see both.
-  4) Continue through the normal auto-reply pipeline (templating, sessions, Pi command).
- **Verbose logging**: In `--verbose`, we log when transcription runs and when the transcript replaces the body.
+- **Media understanding (audio)**: If `tools.media.audio` is enabled and has `models`, Clawdbot:
+  1) Locates the first audio attachment (local path or URL) and downloads it if needed.
+  2) Enforces `maxBytes` before sending to each model entry.
+  3) Runs the first eligible model entry in order (provider or CLI).
+  4) If it fails or skips (size/timeout), it tries the next entry.
+  5) On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
+- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
+- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.

-## Config example (Whisper CLI)
-Requires `whisper` CLI installed:
+## Config examples
+
+### Provider + CLI fallback (OpenAI + Whisper CLI)
 ```json5
 {
  tools: {
-    audio: {
-      transcription: {
-        args: ["--model", "base", "{{MediaPath}}"],
-        timeoutSeconds: 45
+    media: {
+      audio: {
+        enabled: true,
+        maxBytes: 20971520,
+        models: [
+          { provider: "openai", model: "whisper-1" },
+          {
+            type: "cli",
+            command: "whisper",
+            args: ["--model", "base", "{{MediaPath}}"],
+            timeoutSeconds: 45
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+### Provider-only with scope gating
+```json5
+{
+  tools: {
+    media: {
+      audio: {
+        enabled: true,
+        scope: {
+          default: "allow",
+          rules: [
+            { action: "deny", match: { chatType: "group" } }
+          ]
+        },
+        models: [
+          { provider: "openai", model: "whisper-1" }
+        ]
      }
    }
  }
@@ -29,12 +63,13 @@ Requires `whisper` CLI installed:
 ```

 ## Notes & limits
- We don’t ship a transcriber; you opt in with the Whisper CLI on your PATH.
- Size guard: inbound audio must be ≤5 MB (matches the temp media store and transcript pipeline).
- Outbound caps: web send supports audio/voice up to 16 MB (sent as a voice note with `ptt: true`).
- If transcription fails, we fall back to the original body/media note; replies still go through.
- Transcript is available to templates as `{{Transcript}}`; models get both the media path and a `Transcript:` block in the prompt when using command mode.
+- Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`).
+- Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
+- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
+- Transcript is available to templates as `{{Transcript}}`.
+- CLI stdout is capped (5MB); keep CLI output concise.

 ## Gotchas
+- Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`.
 - Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
- Keep timeouts reasonable (`timeoutSeconds`, default 45s) to avoid blocking the reply queue.
+- Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue.
--- a/docs/nodes/images.md
+++ b/docs/nodes/images.md
@@ -38,13 +38,23 @@ The WhatsApp channel runs via **Baileys Web**. This document captures the curren
  - `{{MediaUrl}}` pseudo-URL for the inbound media.
  - `{{MediaPath}}` local temp path written before running the command.
 - When a per-session Docker sandbox is enabled, inbound media is copied into the sandbox workspace and `MediaPath`/`MediaUrl` are rewritten to a relative path like `media/inbound/<filename>`.
- Audio transcription (if configured via `tools.audio.transcription`) runs before templating and can replace `Body` with the transcript.
+- Media understanding (if configured via `tools.media.*`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
+  - Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
+  - Video and image descriptions preserve any caption text for command parsing.
+- Only the first matching image/audio/video attachment is processed; remaining attachments are left untouched.

 ## Limits & Errors
+**Outbound send caps (WhatsApp web send)**
 - Images: ~6 MB cap after recompression.
 - Audio/voice/video: 16 MB cap; documents: 100 MB cap.
 - Oversize or unreadable media → clear error in logs and the reply is skipped.

+**Media understanding caps (transcription/description)**
+- Image default: 10 MB (`tools.media.image.maxBytes`).
+- Audio default: 20 MB (`tools.media.audio.maxBytes`).
+- Video default: 50 MB (`tools.media.video.maxBytes`).
+- Oversize media skips understanding, but replies still go through with the original body.
+
 ## Notes for Tests
 - Cover send + reply flows for image/audio/document cases.
 - Validate recompression for images (size bound) and voice-note flag for audio.
--- a/docs/nodes/media-understanding.md
+++ b/docs/nodes/media-understanding.md
@@ -0,0 +1,217 @@
+---
+summary: "Inbound image/audio/video understanding (optional) with provider + CLI fallbacks"
+read_when:
+  - Designing or refactoring media understanding
+  - Tuning inbound audio/video/image preprocessing
+---
+# Media Understanding (Inbound) — 2026-01-17
+
+Clawdbot can optionally **summarize inbound media** (image/audio/video) before the reply pipeline runs. This is **opt-in** and separate from the base attachment flow—if understanding is off, models still receive the original files/URLs as usual.
+
+## Goals
+- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
+- Preserve original media delivery to the model (always).
+- Support **provider APIs** and **CLI fallbacks**.
+- Allow multiple models with ordered fallback (error/size/timeout).
+
+## High‑level behavior
+1) Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
+2) For each enabled capability (image/audio/video), pick the **first matching attachment**.
+3) Choose the first eligible model entry (size + capability + auth).  
+4) If a model fails or the media is too large, **fall back to the next entry**.
+5) On success:
+   - `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
+   - Audio sets `{{Transcript}}` and `CommandBody`/`RawBody` for command parsing.
+   - Captions are preserved as `User text:` inside the block.
+
+If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.
+
+## Config overview
+Use **per‑capability configs** under `tools.media`. Each capability can define:
+- defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
+- **ordered `models` list** (fallback order)
+- `scope` (optional gating by channel/chatType/session key)
+
+```json5
+{
+  tools: {
+    media: {
+      image: { /* config */ },
+      audio: { /* config */ },
+      video: { /* config */ }
+    }
+  }
+}
+```
+
+### Model entries
+Each `models[]` entry can be **provider** or **CLI**:
+
+```json5
+{
+  type: "provider",        // default if omitted
+  provider: "openai",
+  model: "gpt-5.2",
+  prompt: "Describe the image in <= 500 chars.",
+  maxChars: 500,
+  maxBytes: 10485760,
+  timeoutSeconds: 60,
+  capabilities: ["image"], // optional, used for multi‑modal entries
+  profile: "vision-profile",
+  preferredProfile: "vision-fallback"
+}
+```
+
+```json5
+{
+  type: "cli",
+  command: "gemini",
+  args: [
+    "-m",
+    "gemini-3-flash",
+    "--allowed-tools",
+    "read_file",
+    "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
+  ],
+  maxChars: 500,
+  maxBytes: 52428800,
+  timeoutSeconds: 120,
+  capabilities: ["video", "image"]
+}
+```
+
+## Defaults and limits
+Recommended defaults:
+- `maxChars`: **500** for image/video (short, command‑friendly)
+- `maxChars`: **unset** for audio (full transcript unless you set a limit)
+- `maxBytes`:
+  - image: **10MB**
+  - audio: **20MB**
+  - video: **50MB**
+
+Rules:
+- If media exceeds `maxBytes`, that model is skipped and the **next model is tried**.
+- If the model returns more than `maxChars`, output is trimmed.
+- `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
+
+## Capabilities (optional)
+If you set `capabilities`, the entry only runs for those media types. Suggested
+defaults when you opt in:
+- `openai`, `anthropic`: **image**
+- `google` (Gemini API): **image + audio + video**
+- CLI entries: declare the exact capabilities you support.
+
+If you omit `capabilities`, the entry is eligible for the list it appears in.
+
+## Provider support matrix (Clawdbot integrations)
+| Capability | Provider integration | Notes |
+|------------|----------------------|-------|
+| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
+| Audio | OpenAI, Groq | Provider transcription (Whisper). |
+| Video | Google (Gemini API) | Provider video understanding. |
+
+## Recommended providers
+**Image**
+- Prefer your active model if it supports images.
+- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
+
+**Audio**
+- `openai/whisper-1` or `groq/whisper-large-v3-turbo`.
+- CLI fallback: `whisper` binary.
+
+**Video**
+- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
+- CLI fallback: `gemini` CLI (supports `read_file` on video/audio).
+
+## Config examples
+
+### 1) Audio + Video only (image off)
+```json5
+{
+  tools: {
+    media: {
+      audio: {
+        enabled: true,
+        models: [
+          { provider: "openai", model: "whisper-1" },
+          {
+            type: "cli",
+            command: "whisper",
+            args: ["--model", "base", "{{MediaPath}}"]
+          }
+        ]
+      },
+      video: {
+        enabled: true,
+        maxChars: 500,
+        models: [
+          { provider: "google", model: "gemini-3-flash-preview" },
+          {
+            type: "cli",
+            command: "gemini",
+            args: [
+              "-m",
+              "gemini-3-flash",
+              "--allowed-tools",
+              "read_file",
+              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
+            ]
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+### 2) Optional image understanding
+```json5
+{
+  tools: {
+    media: {
+      image: {
+        enabled: true,
+        maxBytes: 10485760,
+        maxChars: 500,
+        models: [
+          { provider: "openai", model: "gpt-5.2" },
+          { provider: "anthropic", model: "claude-opus-4-5" },
+          {
+            type: "cli",
+            command: "gemini",
+            args: [
+              "-m",
+              "gemini-3-flash",
+              "--allowed-tools",
+              "read_file",
+              "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
+            ]
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+### 3) Multi‑modal single entry (explicit capabilities)
+```json5
+{
+  tools: {
+    media: {
+      image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
+      audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
+      video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
+    }
+  }
+}
+```
+
+## Notes
+- Understanding is **best‑effort**. Errors do not block replies.
+- Attachments are still passed to models even when understanding is disabled.
+- Use `scope` to limit where understanding runs (e.g. only DMs).
+
+## Related docs
+- [Configuration](/gateway/configuration)
+- [Image & Media Support](/nodes/images)