Add ez-assistant and kerberos service folders
This commit is contained in:
109
docker-compose/ez-assistant/docs/nodes/audio.md
Normal file
109
docker-compose/ez-assistant/docs/nodes/audio.md
Normal file
@@ -0,0 +1,109 @@
|
||||
---
|
||||
summary: "How inbound audio/voice notes are downloaded, transcribed, and injected into replies"
|
||||
read_when:
|
||||
- Changing audio transcription or media handling
|
||||
---
|
||||
# Audio / Voice Notes — 2026-01-17
|
||||
|
||||
## What works
|
||||
- **Media understanding (audio)**: If audio understanding is enabled (or auto‑detected), Moltbot:
|
||||
1) Locates the first audio attachment (local path or URL) and downloads it if needed.
|
||||
2) Enforces `maxBytes` before sending to each model entry.
|
||||
3) Runs the first eligible model entry in order (provider or CLI).
|
||||
4) If it fails or skips (size/timeout), it tries the next entry.
|
||||
5) On success, it replaces `Body` with an `[Audio]` block and sets `{{Transcript}}`.
|
||||
- **Command parsing**: When transcription succeeds, `CommandBody`/`RawBody` are set to the transcript so slash commands still work.
|
||||
- **Verbose logging**: In `--verbose`, we log when transcription runs and when it replaces the body.
|
||||
|
||||
## Auto-detection (default)
|
||||
If you **don’t configure models** and `tools.media.audio.enabled` is **not** set to `false`,
|
||||
Moltbot auto-detects in this order and stops at the first working option:
|
||||
|
||||
1) **Local CLIs** (if installed)
|
||||
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
|
||||
- `whisper-cli` (from `whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
|
||||
- `whisper` (Python CLI; downloads models automatically)
|
||||
2) **Gemini CLI** (`gemini`) using `read_many_files`
|
||||
3) **Provider keys** (OpenAI → Groq → Deepgram → Google)
|
||||
|
||||
To disable auto-detection, set `tools.media.audio.enabled: false`.
|
||||
To customize, set `tools.media.audio.models`.
|
||||
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
|
||||
|
||||
## Config examples
|
||||
|
||||
### Provider + CLI fallback (OpenAI + Whisper CLI)
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
maxBytes: 20971520,
|
||||
models: [
|
||||
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
|
||||
{
|
||||
type: "cli",
|
||||
command: "whisper",
|
||||
args: ["--model", "base", "{{MediaPath}}"],
|
||||
timeoutSeconds: 45
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Provider-only with scope gating
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
scope: {
|
||||
default: "allow",
|
||||
rules: [
|
||||
{ action: "deny", match: { chatType: "group" } }
|
||||
]
|
||||
},
|
||||
models: [
|
||||
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Provider-only (Deepgram)
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
models: [{ provider: "deepgram", model: "nova-3" }]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Notes & limits
|
||||
- Provider auth follows the standard model auth order (auth profiles, env vars, `models.providers.*.apiKey`).
|
||||
- Deepgram picks up `DEEPGRAM_API_KEY` when `provider: "deepgram"` is used.
|
||||
- Deepgram setup details: [Deepgram (audio transcription)](/providers/deepgram).
|
||||
- Audio providers can override `baseUrl`, `headers`, and `providerOptions` via `tools.media.audio`.
|
||||
- Default size cap is 20MB (`tools.media.audio.maxBytes`). Oversize audio is skipped for that model and the next entry is tried.
|
||||
- Default `maxChars` for audio is **unset** (full transcript). Set `tools.media.audio.maxChars` or per-entry `maxChars` to trim output.
|
||||
- OpenAI auto default is `gpt-4o-mini-transcribe`; set `model: "gpt-4o-transcribe"` for higher accuracy.
|
||||
- Use `tools.media.audio.attachments` to process multiple voice notes (`mode: "all"` + `maxAttachments`).
|
||||
- Transcript is available to templates as `{{Transcript}}`.
|
||||
- CLI stdout is capped (5MB); keep CLI output concise.
|
||||
|
||||
## Gotchas
|
||||
- Scope rules use first-match wins. `chatType` is normalized to `direct`, `group`, or `room`.
|
||||
- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via `jq -r .text`.
|
||||
- Keep timeouts reasonable (`timeoutSeconds`, default 60s) to avoid blocking the reply queue.
|
||||
152
docker-compose/ez-assistant/docs/nodes/camera.md
Normal file
152
docker-compose/ez-assistant/docs/nodes/camera.md
Normal file
@@ -0,0 +1,152 @@
|
||||
---
|
||||
summary: "Camera capture (iOS node + macOS app) for agent use: photos (jpg) and short video clips (mp4)"
|
||||
read_when:
|
||||
- Adding or modifying camera capture on iOS nodes or macOS
|
||||
- Extending agent-accessible MEDIA temp-file workflows
|
||||
---
|
||||
|
||||
# Camera capture (agent)
|
||||
|
||||
Moltbot supports **camera capture** for agent workflows:
|
||||
|
||||
- **iOS node** (paired via Gateway): capture a **photo** (`jpg`) or **short video clip** (`mp4`, with optional audio) via `node.invoke`.
|
||||
- **Android node** (paired via Gateway): capture a **photo** (`jpg`) or **short video clip** (`mp4`, with optional audio) via `node.invoke`.
|
||||
- **macOS app** (node via Gateway): capture a **photo** (`jpg`) or **short video clip** (`mp4`, with optional audio) via `node.invoke`.
|
||||
|
||||
All camera access is gated behind **user-controlled settings**.
|
||||
|
||||
## iOS node
|
||||
|
||||
### User setting (default on)
|
||||
|
||||
- iOS Settings tab → **Camera** → **Allow Camera** (`camera.enabled`)
|
||||
- Default: **on** (missing key is treated as enabled).
|
||||
- When off: `camera.*` commands return `CAMERA_DISABLED`.
|
||||
|
||||
### Commands (via Gateway `node.invoke`)
|
||||
|
||||
- `camera.list`
|
||||
- Response payload:
|
||||
- `devices`: array of `{ id, name, position, deviceType }`
|
||||
|
||||
- `camera.snap`
|
||||
- Params:
|
||||
- `facing`: `front|back` (default: `front`)
|
||||
- `maxWidth`: number (optional; default `1600` on the iOS node)
|
||||
- `quality`: `0..1` (optional; default `0.9`)
|
||||
- `format`: currently `jpg`
|
||||
- `delayMs`: number (optional; default `0`)
|
||||
- `deviceId`: string (optional; from `camera.list`)
|
||||
- Response payload:
|
||||
- `format: "jpg"`
|
||||
- `base64: "<...>"`
|
||||
- `width`, `height`
|
||||
- Payload guard: photos are recompressed to keep the base64 payload under 5 MB.
|
||||
|
||||
- `camera.clip`
|
||||
- Params:
|
||||
- `facing`: `front|back` (default: `front`)
|
||||
- `durationMs`: number (default `3000`, clamped to a max of `60000`)
|
||||
- `includeAudio`: boolean (default `true`)
|
||||
- `format`: currently `mp4`
|
||||
- `deviceId`: string (optional; from `camera.list`)
|
||||
- Response payload:
|
||||
- `format: "mp4"`
|
||||
- `base64: "<...>"`
|
||||
- `durationMs`
|
||||
- `hasAudio`
|
||||
|
||||
### Foreground requirement
|
||||
|
||||
Like `canvas.*`, the iOS node only allows `camera.*` commands in the **foreground**. Background invocations return `NODE_BACKGROUND_UNAVAILABLE`.
|
||||
|
||||
### CLI helper (temp files + MEDIA)
|
||||
|
||||
The easiest way to get attachments is via the CLI helper, which writes decoded media to a temp file and prints `MEDIA:<path>`.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
moltbot nodes camera snap --node <id> # default: both front + back (2 MEDIA lines)
|
||||
moltbot nodes camera snap --node <id> --facing front
|
||||
moltbot nodes camera clip --node <id> --duration 3000
|
||||
moltbot nodes camera clip --node <id> --no-audio
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `nodes camera snap` defaults to **both** facings to give the agent both views.
|
||||
- Output files are temporary (in the OS temp directory) unless you build your own wrapper.
|
||||
|
||||
## Android node
|
||||
|
||||
### User setting (default on)
|
||||
|
||||
- Android Settings sheet → **Camera** → **Allow Camera** (`camera.enabled`)
|
||||
- Default: **on** (missing key is treated as enabled).
|
||||
- When off: `camera.*` commands return `CAMERA_DISABLED`.
|
||||
|
||||
### Permissions
|
||||
|
||||
- Android requires runtime permissions:
|
||||
- `CAMERA` for both `camera.snap` and `camera.clip`.
|
||||
- `RECORD_AUDIO` for `camera.clip` when `includeAudio=true`.
|
||||
|
||||
If permissions are missing, the app will prompt when possible; if denied, `camera.*` requests fail with a
|
||||
`*_PERMISSION_REQUIRED` error.
|
||||
|
||||
### Foreground requirement
|
||||
|
||||
Like `canvas.*`, the Android node only allows `camera.*` commands in the **foreground**. Background invocations return `NODE_BACKGROUND_UNAVAILABLE`.
|
||||
|
||||
### Payload guard
|
||||
|
||||
Photos are recompressed to keep the base64 payload under 5 MB.
|
||||
|
||||
## macOS app
|
||||
|
||||
### User setting (default off)
|
||||
|
||||
The macOS companion app exposes a checkbox:
|
||||
|
||||
- **Settings → General → Allow Camera** (`moltbot.cameraEnabled`)
|
||||
- Default: **off**
|
||||
- When off: camera requests return “Camera disabled by user”.
|
||||
|
||||
### CLI helper (node invoke)
|
||||
|
||||
Use the main `moltbot` CLI to invoke camera commands on the macOS node.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
moltbot nodes camera list --node <id> # list camera ids
|
||||
moltbot nodes camera snap --node <id> # prints MEDIA:<path>
|
||||
moltbot nodes camera snap --node <id> --max-width 1280
|
||||
moltbot nodes camera snap --node <id> --delay-ms 2000
|
||||
moltbot nodes camera snap --node <id> --device-id <id>
|
||||
moltbot nodes camera clip --node <id> --duration 10s # prints MEDIA:<path>
|
||||
moltbot nodes camera clip --node <id> --duration-ms 3000 # prints MEDIA:<path> (legacy flag)
|
||||
moltbot nodes camera clip --node <id> --device-id <id>
|
||||
moltbot nodes camera clip --node <id> --no-audio
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `moltbot nodes camera snap` defaults to `maxWidth=1600` unless overridden.
|
||||
- On macOS, `camera.snap` waits `delayMs` (default 2000ms) after warm-up/exposure settle before capturing.
|
||||
- Photo payloads are recompressed to keep base64 under 5 MB.
|
||||
|
||||
## Safety + practical limits
|
||||
|
||||
- Camera and microphone access trigger the usual OS permission prompts (and require usage strings in Info.plist).
|
||||
- Video clips are capped (currently `<= 60s`) to avoid oversized node payloads (base64 overhead + message limits).
|
||||
|
||||
## macOS screen video (OS-level)
|
||||
|
||||
For *screen* video (not camera), use the macOS companion:
|
||||
|
||||
```bash
|
||||
moltbot nodes screen record --node <id> --duration 10s --fps 15 # prints MEDIA:<path>
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Requires macOS **Screen Recording** permission (TCC).
|
||||
61
docker-compose/ez-assistant/docs/nodes/images.md
Normal file
61
docker-compose/ez-assistant/docs/nodes/images.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
summary: "Image and media handling rules for send, gateway, and agent replies"
|
||||
read_when:
|
||||
- Modifying media pipeline or attachments
|
||||
---
|
||||
# Image & Media Support — 2025-12-05
|
||||
|
||||
The WhatsApp channel runs via **Baileys Web**. This document captures the current media handling rules for send, gateway, and agent replies.
|
||||
|
||||
## Goals
|
||||
- Send media with optional captions via `moltbot message send --media`.
|
||||
- Allow auto-replies from the web inbox to include media alongside text.
|
||||
- Keep per-type limits sane and predictable.
|
||||
|
||||
## CLI Surface
|
||||
- `moltbot message send --media <path-or-url> [--message <caption>]`
|
||||
- `--media` optional; caption can be empty for media-only sends.
|
||||
- `--dry-run` prints the resolved payload; `--json` emits `{ channel, to, messageId, mediaUrl, caption }`.
|
||||
|
||||
## WhatsApp Web channel behavior
|
||||
- Input: local file path **or** HTTP(S) URL.
|
||||
- Flow: load into a Buffer, detect media kind, and build the correct payload:
|
||||
- **Images:** resize & recompress to JPEG (max side 2048px) targeting `agents.defaults.mediaMaxMb` (default 5 MB), capped at 6 MB.
|
||||
- **Audio/Voice/Video:** pass-through up to 16 MB; audio is sent as a voice note (`ptt: true`).
|
||||
- **Documents:** anything else, up to 100 MB, with filename preserved when available.
|
||||
- WhatsApp GIF-style playback: send an MP4 with `gifPlayback: true` (CLI: `--gif-playback`) so mobile clients loop inline.
|
||||
- MIME detection prefers magic bytes, then headers, then file extension.
|
||||
- Caption comes from `--message` or `reply.text`; empty caption is allowed.
|
||||
- Logging: non-verbose shows `↩️`/`✅`; verbose includes size and source path/URL.
|
||||
|
||||
## Auto-Reply Pipeline
|
||||
- `getReplyFromConfig` returns `{ text?, mediaUrl?, mediaUrls? }`.
|
||||
- When media is present, the web sender resolves local paths or URLs using the same pipeline as `moltbot message send`.
|
||||
- Multiple media entries are sent sequentially if provided.
|
||||
|
||||
## Inbound Media to Commands (Pi)
|
||||
- When inbound web messages include media, Moltbot downloads to a temp file and exposes templating variables:
|
||||
- `{{MediaUrl}}` pseudo-URL for the inbound media.
|
||||
- `{{MediaPath}}` local temp path written before running the command.
|
||||
- When a per-session Docker sandbox is enabled, inbound media is copied into the sandbox workspace and `MediaPath`/`MediaUrl` are rewritten to a relative path like `media/inbound/<filename>`.
|
||||
- Media understanding (if configured via `tools.media.*` or shared `tools.media.models`) runs before templating and can insert `[Image]`, `[Audio]`, and `[Video]` blocks into `Body`.
|
||||
- Audio sets `{{Transcript}}` and uses the transcript for command parsing so slash commands still work.
|
||||
- Video and image descriptions preserve any caption text for command parsing.
|
||||
- By default only the first matching image/audio/video attachment is processed; set `tools.media.<cap>.attachments` to process multiple attachments.
|
||||
|
||||
## Limits & Errors
|
||||
**Outbound send caps (WhatsApp web send)**
|
||||
- Images: ~6 MB cap after recompression.
|
||||
- Audio/voice/video: 16 MB cap; documents: 100 MB cap.
|
||||
- Oversize or unreadable media → clear error in logs and the reply is skipped.
|
||||
|
||||
**Media understanding caps (transcription/description)**
|
||||
- Image default: 10 MB (`tools.media.image.maxBytes`).
|
||||
- Audio default: 20 MB (`tools.media.audio.maxBytes`).
|
||||
- Video default: 50 MB (`tools.media.video.maxBytes`).
|
||||
- Oversize media skips understanding, but replies still go through with the original body.
|
||||
|
||||
## Notes for Tests
|
||||
- Cover send + reply flows for image/audio/document cases.
|
||||
- Validate recompression for images (size bound) and voice-note flag for audio.
|
||||
- Ensure multi-media replies fan out as sequential sends.
|
||||
305
docker-compose/ez-assistant/docs/nodes/index.md
Normal file
305
docker-compose/ez-assistant/docs/nodes/index.md
Normal file
@@ -0,0 +1,305 @@
|
||||
---
|
||||
summary: "Nodes: pairing, capabilities, permissions, and CLI helpers for canvas/camera/screen/system"
|
||||
read_when:
|
||||
- Pairing iOS/Android nodes to a gateway
|
||||
- Using node canvas/camera for agent context
|
||||
- Adding new node commands or CLI helpers
|
||||
---
|
||||
|
||||
# Nodes
|
||||
|
||||
A **node** is a companion device (macOS/iOS/Android/headless) that connects to the Gateway **WebSocket** (same port as operators) with `role: "node"` and exposes a command surface (e.g. `canvas.*`, `camera.*`, `system.*`) via `node.invoke`. Protocol details: [Gateway protocol](/gateway/protocol).
|
||||
|
||||
Legacy transport: [Bridge protocol](/gateway/bridge-protocol) (TCP JSONL; deprecated/removed for current nodes).
|
||||
|
||||
macOS can also run in **node mode**: the menubar app connects to the Gateway’s WS server and exposes its local canvas/camera commands as a node (so `moltbot nodes …` works against this Mac).
|
||||
|
||||
Notes:
|
||||
- Nodes are **peripherals**, not gateways. They don’t run the gateway service.
|
||||
- Telegram/WhatsApp/etc. messages land on the **gateway**, not on nodes.
|
||||
|
||||
## Pairing + status
|
||||
|
||||
**WS nodes use device pairing.** Nodes present a device identity during `connect`; the Gateway
|
||||
creates a device pairing request for `role: node`. Approve via the devices CLI (or UI).
|
||||
|
||||
Quick CLI:
|
||||
|
||||
```bash
|
||||
moltbot devices list
|
||||
moltbot devices approve <requestId>
|
||||
moltbot devices reject <requestId>
|
||||
moltbot nodes status
|
||||
moltbot nodes describe --node <idOrNameOrIp>
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `nodes status` marks a node as **paired** when its device pairing role includes `node`.
|
||||
- `node.pair.*` (CLI: `moltbot nodes pending/approve/reject`) is a separate gateway-owned
|
||||
node pairing store; it does **not** gate the WS `connect` handshake.
|
||||
|
||||
## Remote node host (system.run)
|
||||
|
||||
Use a **node host** when your Gateway runs on one machine and you want commands
|
||||
to execute on another. The model still talks to the **gateway**; the gateway
|
||||
forwards `exec` calls to the **node host** when `host=node` is selected.
|
||||
|
||||
### What runs where
|
||||
- **Gateway host**: receives messages, runs the model, routes tool calls.
|
||||
- **Node host**: executes `system.run`/`system.which` on the node machine.
|
||||
- **Approvals**: enforced on the node host via `~/.clawdbot/exec-approvals.json`.
|
||||
|
||||
### Start a node host (foreground)
|
||||
|
||||
On the node machine:
|
||||
|
||||
```bash
|
||||
moltbot node run --host <gateway-host> --port 18789 --display-name "Build Node"
|
||||
```
|
||||
|
||||
### Start a node host (service)
|
||||
|
||||
```bash
|
||||
moltbot node install --host <gateway-host> --port 18789 --display-name "Build Node"
|
||||
moltbot node restart
|
||||
```
|
||||
|
||||
### Pair + name
|
||||
|
||||
On the gateway host:
|
||||
|
||||
```bash
|
||||
moltbot nodes pending
|
||||
moltbot nodes approve <requestId>
|
||||
moltbot nodes list
|
||||
```
|
||||
|
||||
Naming options:
|
||||
- `--display-name` on `moltbot node run` / `moltbot node install` (persists in `~/.clawdbot/node.json` on the node).
|
||||
- `moltbot nodes rename --node <id|name|ip> --name "Build Node"` (gateway override).
|
||||
|
||||
### Allowlist the commands
|
||||
|
||||
Exec approvals are **per node host**. Add allowlist entries from the gateway:
|
||||
|
||||
```bash
|
||||
moltbot approvals allowlist add --node <id|name|ip> "/usr/bin/uname"
|
||||
moltbot approvals allowlist add --node <id|name|ip> "/usr/bin/sw_vers"
|
||||
```
|
||||
|
||||
Approvals live on the node host at `~/.clawdbot/exec-approvals.json`.
|
||||
|
||||
### Point exec at the node
|
||||
|
||||
Configure defaults (gateway config):
|
||||
|
||||
```bash
|
||||
moltbot config set tools.exec.host node
|
||||
moltbot config set tools.exec.security allowlist
|
||||
moltbot config set tools.exec.node "<id-or-name>"
|
||||
```
|
||||
|
||||
Or per session:
|
||||
|
||||
```
|
||||
/exec host=node security=allowlist node=<id-or-name>
|
||||
```
|
||||
|
||||
Once set, any `exec` call with `host=node` runs on the node host (subject to the
|
||||
node allowlist/approvals).
|
||||
|
||||
Related:
|
||||
- [Node host CLI](/cli/node)
|
||||
- [Exec tool](/tools/exec)
|
||||
- [Exec approvals](/tools/exec-approvals)
|
||||
|
||||
## Invoking commands
|
||||
|
||||
Low-level (raw RPC):
|
||||
|
||||
```bash
|
||||
moltbot nodes invoke --node <idOrNameOrIp> --command canvas.eval --params '{"javaScript":"location.href"}'
|
||||
```
|
||||
|
||||
Higher-level helpers exist for the common “give the agent a MEDIA attachment” workflows.
|
||||
|
||||
## Screenshots (canvas snapshots)
|
||||
|
||||
If the node is showing the Canvas (WebView), `canvas.snapshot` returns `{ format, base64 }`.
|
||||
|
||||
CLI helper (writes to a temp file and prints `MEDIA:<path>`):
|
||||
|
||||
```bash
|
||||
moltbot nodes canvas snapshot --node <idOrNameOrIp> --format png
|
||||
moltbot nodes canvas snapshot --node <idOrNameOrIp> --format jpg --max-width 1200 --quality 0.9
|
||||
```
|
||||
|
||||
### Canvas controls
|
||||
|
||||
```bash
|
||||
moltbot nodes canvas present --node <idOrNameOrIp> --target https://example.com
|
||||
moltbot nodes canvas hide --node <idOrNameOrIp>
|
||||
moltbot nodes canvas navigate https://example.com --node <idOrNameOrIp>
|
||||
moltbot nodes canvas eval --node <idOrNameOrIp> --js "document.title"
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `canvas present` accepts URLs or local file paths (`--target`), plus optional `--x/--y/--width/--height` for positioning.
|
||||
- `canvas eval` accepts inline JS (`--js`) or a positional arg.
|
||||
|
||||
### A2UI (Canvas)
|
||||
|
||||
```bash
|
||||
moltbot nodes canvas a2ui push --node <idOrNameOrIp> --text "Hello"
|
||||
moltbot nodes canvas a2ui push --node <idOrNameOrIp> --jsonl ./payload.jsonl
|
||||
moltbot nodes canvas a2ui reset --node <idOrNameOrIp>
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Only A2UI v0.8 JSONL is supported (v0.9/createSurface is rejected).
|
||||
|
||||
## Photos + videos (node camera)
|
||||
|
||||
Photos (`jpg`):
|
||||
|
||||
```bash
|
||||
moltbot nodes camera list --node <idOrNameOrIp>
|
||||
moltbot nodes camera snap --node <idOrNameOrIp> # default: both facings (2 MEDIA lines)
|
||||
moltbot nodes camera snap --node <idOrNameOrIp> --facing front
|
||||
```
|
||||
|
||||
Video clips (`mp4`):
|
||||
|
||||
```bash
|
||||
moltbot nodes camera clip --node <idOrNameOrIp> --duration 10s
|
||||
moltbot nodes camera clip --node <idOrNameOrIp> --duration 3000 --no-audio
|
||||
```
|
||||
|
||||
Notes:
|
||||
- The node must be **foregrounded** for `canvas.*` and `camera.*` (background calls return `NODE_BACKGROUND_UNAVAILABLE`).
|
||||
- Clip duration is clamped (currently `<= 60s`) to avoid oversized base64 payloads.
|
||||
- Android will prompt for `CAMERA`/`RECORD_AUDIO` permissions when possible; denied permissions fail with `*_PERMISSION_REQUIRED`.
|
||||
|
||||
## Screen recordings (nodes)
|
||||
|
||||
Nodes expose `screen.record` (mp4). Example:
|
||||
|
||||
```bash
|
||||
moltbot nodes screen record --node <idOrNameOrIp> --duration 10s --fps 10
|
||||
moltbot nodes screen record --node <idOrNameOrIp> --duration 10s --fps 10 --no-audio
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `screen.record` requires the node app to be foregrounded.
|
||||
- Android will show the system screen-capture prompt before recording.
|
||||
- Screen recordings are clamped to `<= 60s`.
|
||||
- `--no-audio` disables microphone capture (supported on iOS/Android; macOS uses system capture audio).
|
||||
- Use `--screen <index>` to select a display when multiple screens are available.
|
||||
|
||||
## Location (nodes)
|
||||
|
||||
Nodes expose `location.get` when Location is enabled in settings.
|
||||
|
||||
CLI helper:
|
||||
|
||||
```bash
|
||||
moltbot nodes location get --node <idOrNameOrIp>
|
||||
moltbot nodes location get --node <idOrNameOrIp> --accuracy precise --max-age 15000 --location-timeout 10000
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Location is **off by default**.
|
||||
- “Always” requires system permission; background fetch is best-effort.
|
||||
- The response includes lat/lon, accuracy (meters), and timestamp.
|
||||
|
||||
## SMS (Android nodes)
|
||||
|
||||
Android nodes can expose `sms.send` when the user grants **SMS** permission and the device supports telephony.
|
||||
|
||||
Low-level invoke:
|
||||
|
||||
```bash
|
||||
moltbot nodes invoke --node <idOrNameOrIp> --command sms.send --params '{"to":"+15555550123","message":"Hello from Moltbot"}'
|
||||
```
|
||||
|
||||
Notes:
|
||||
- The permission prompt must be accepted on the Android device before the capability is advertised.
|
||||
- Wi-Fi-only devices without telephony will not advertise `sms.send`.
|
||||
|
||||
## System commands (node host / mac node)
|
||||
|
||||
The macOS node exposes `system.run`, `system.notify`, and `system.execApprovals.get/set`.
|
||||
The headless node host exposes `system.run`, `system.which`, and `system.execApprovals.get/set`.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
moltbot nodes run --node <idOrNameOrIp> -- echo "Hello from mac node"
|
||||
moltbot nodes notify --node <idOrNameOrIp> --title "Ping" --body "Gateway ready"
|
||||
```
|
||||
|
||||
Notes:
|
||||
- `system.run` returns stdout/stderr/exit code in the payload.
|
||||
- `system.notify` respects notification permission state on the macOS app.
|
||||
- `system.run` supports `--cwd`, `--env KEY=VAL`, `--command-timeout`, and `--needs-screen-recording`.
|
||||
- `system.notify` supports `--priority <passive|active|timeSensitive>` and `--delivery <system|overlay|auto>`.
|
||||
- macOS nodes drop `PATH` overrides; headless node hosts only accept `PATH` when it prepends the node host PATH.
|
||||
- On macOS node mode, `system.run` is gated by exec approvals in the macOS app (Settings → Exec approvals).
|
||||
Ask/allowlist/full behave the same as the headless node host; denied prompts return `SYSTEM_RUN_DENIED`.
|
||||
- On headless node host, `system.run` is gated by exec approvals (`~/.clawdbot/exec-approvals.json`).
|
||||
|
||||
## Exec node binding
|
||||
|
||||
When multiple nodes are available, you can bind exec to a specific node.
|
||||
This sets the default node for `exec host=node` (and can be overridden per agent).
|
||||
|
||||
Global default:
|
||||
|
||||
```bash
|
||||
moltbot config set tools.exec.node "node-id-or-name"
|
||||
```
|
||||
|
||||
Per-agent override:
|
||||
|
||||
```bash
|
||||
moltbot config get agents.list
|
||||
moltbot config set agents.list[0].tools.exec.node "node-id-or-name"
|
||||
```
|
||||
|
||||
Unset to allow any node:
|
||||
|
||||
```bash
|
||||
moltbot config unset tools.exec.node
|
||||
moltbot config unset agents.list[0].tools.exec.node
|
||||
```
|
||||
|
||||
## Permissions map
|
||||
|
||||
Nodes may include a `permissions` map in `node.list` / `node.describe`, keyed by permission name (e.g. `screenRecording`, `accessibility`) with boolean values (`true` = granted).
|
||||
|
||||
## Headless node host (cross-platform)
|
||||
|
||||
Moltbot can run a **headless node host** (no UI) that connects to the Gateway
|
||||
WebSocket and exposes `system.run` / `system.which`. This is useful on Linux/Windows
|
||||
or for running a minimal node alongside a server.
|
||||
|
||||
Start it:
|
||||
|
||||
```bash
|
||||
moltbot node run --host <gateway-host> --port 18789
|
||||
```
|
||||
|
||||
Notes:
|
||||
- Pairing is still required (the Gateway will show a node approval prompt).
|
||||
- The node host stores its node id, token, display name, and gateway connection info in `~/.clawdbot/node.json`.
|
||||
- Exec approvals are enforced locally via `~/.clawdbot/exec-approvals.json`
|
||||
(see [Exec approvals](/tools/exec-approvals)).
|
||||
- On macOS, the headless node host prefers the companion app exec host when reachable and falls
|
||||
back to local execution if the app is unavailable. Set `CLAWDBOT_NODE_EXEC_HOST=app` to require
|
||||
the app, or `CLAWDBOT_NODE_EXEC_FALLBACK=0` to disable fallback.
|
||||
- Add `--tls` / `--tls-fingerprint` when the Gateway WS uses TLS.
|
||||
|
||||
## Mac node mode
|
||||
|
||||
- The macOS menubar app connects to the Gateway WS server as a node (so `moltbot nodes …` works against this Mac).
|
||||
- In remote mode, the app opens an SSH tunnel for the Gateway port and connects to `localhost`.
|
||||
95
docker-compose/ez-assistant/docs/nodes/location-command.md
Normal file
95
docker-compose/ez-assistant/docs/nodes/location-command.md
Normal file
@@ -0,0 +1,95 @@
|
||||
---
|
||||
summary: "Location command for nodes (location.get), permission modes, and background behavior"
|
||||
read_when:
|
||||
- Adding location node support or permissions UI
|
||||
- Designing background location + push flows
|
||||
---
|
||||
|
||||
# Location command (nodes)
|
||||
|
||||
## TL;DR
|
||||
- `location.get` is a node command (via `node.invoke`).
|
||||
- Off by default.
|
||||
- Settings use a selector: Off / While Using / Always.
|
||||
- Separate toggle: Precise Location.
|
||||
|
||||
## Why a selector (not just a switch)
|
||||
OS permissions are multi-level. We can expose a selector in-app, but the OS still decides the actual grant.
|
||||
- iOS/macOS: user can choose **While Using** or **Always** in system prompts/Settings. App can request upgrade, but OS may require Settings.
|
||||
- Android: background location is a separate permission; on Android 10+ it often requires a Settings flow.
|
||||
- Precise location is a separate grant (iOS 14+ “Precise”, Android “fine” vs “coarse”).
|
||||
|
||||
Selector in UI drives our requested mode; actual grant lives in OS settings.
|
||||
|
||||
## Settings model
|
||||
Per node device:
|
||||
- `location.enabledMode`: `off | whileUsing | always`
|
||||
- `location.preciseEnabled`: bool
|
||||
|
||||
UI behavior:
|
||||
- Selecting `whileUsing` requests foreground permission.
|
||||
- Selecting `always` first ensures `whileUsing`, then requests background (or sends user to Settings if required).
|
||||
- If OS denies requested level, revert to the highest granted level and show status.
|
||||
|
||||
## Permissions mapping (node.permissions)
|
||||
Optional. macOS node reports `location` via the permissions map; iOS/Android may omit it.
|
||||
|
||||
## Command: `location.get`
|
||||
Called via `node.invoke`.
|
||||
|
||||
Params (suggested):
|
||||
```json
|
||||
{
|
||||
"timeoutMs": 10000,
|
||||
"maxAgeMs": 15000,
|
||||
"desiredAccuracy": "coarse|balanced|precise"
|
||||
}
|
||||
```
|
||||
|
||||
Response payload:
|
||||
```json
|
||||
{
|
||||
"lat": 48.20849,
|
||||
"lon": 16.37208,
|
||||
"accuracyMeters": 12.5,
|
||||
"altitudeMeters": 182.0,
|
||||
"speedMps": 0.0,
|
||||
"headingDeg": 270.0,
|
||||
"timestamp": "2026-01-03T12:34:56.000Z",
|
||||
"isPrecise": true,
|
||||
"source": "gps|wifi|cell|unknown"
|
||||
}
|
||||
```
|
||||
|
||||
Errors (stable codes):
|
||||
- `LOCATION_DISABLED`: selector is off.
|
||||
- `LOCATION_PERMISSION_REQUIRED`: permission missing for requested mode.
|
||||
- `LOCATION_BACKGROUND_UNAVAILABLE`: app is backgrounded but only While Using allowed.
|
||||
- `LOCATION_TIMEOUT`: no fix in time.
|
||||
- `LOCATION_UNAVAILABLE`: system failure / no providers.
|
||||
|
||||
## Background behavior (future)
|
||||
Goal: model can request location even when node is backgrounded, but only when:
|
||||
- User selected **Always**.
|
||||
- OS grants background location.
|
||||
- App is allowed to run in background for location (iOS background mode / Android foreground service or special allowance).
|
||||
|
||||
Push-triggered flow (future):
|
||||
1) Gateway sends a push to the node (silent push or FCM data).
|
||||
2) Node wakes briefly and requests location from the device.
|
||||
3) Node forwards payload to Gateway.
|
||||
|
||||
Notes:
|
||||
- iOS: Always permission + background location mode required. Silent push may be throttled; expect intermittent failures.
|
||||
- Android: background location may require a foreground service; otherwise, expect denial.
|
||||
|
||||
## Model/tooling integration
|
||||
- Tool surface: `nodes` tool adds `location_get` action (node required).
|
||||
- CLI: `moltbot nodes location get --node <id>`.
|
||||
- Agent guidelines: only call when user enabled location and understands the scope.
|
||||
|
||||
## UX copy (suggested)
|
||||
- Off: “Location sharing is disabled.”
|
||||
- While Using: “Only when Moltbot is open.”
|
||||
- Always: “Allow background location. Requires system permission.”
|
||||
- Precise: “Use precise GPS location. Toggle off to share approximate location.”
|
||||
313
docker-compose/ez-assistant/docs/nodes/media-understanding.md
Normal file
313
docker-compose/ez-assistant/docs/nodes/media-understanding.md
Normal file
@@ -0,0 +1,313 @@
|
||||
---
|
||||
summary: "Inbound image/audio/video understanding (optional) with provider + CLI fallbacks"
|
||||
read_when:
|
||||
- Designing or refactoring media understanding
|
||||
- Tuning inbound audio/video/image preprocessing
|
||||
---
|
||||
# Media Understanding (Inbound) — 2026-01-17
|
||||
|
||||
Moltbot can **summarize inbound media** (image/audio/video) before the reply pipeline runs. It auto‑detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
|
||||
|
||||
## Goals
|
||||
- Optional: pre‑digest inbound media into short text for faster routing + better command parsing.
|
||||
- Preserve original media delivery to the model (always).
|
||||
- Support **provider APIs** and **CLI fallbacks**.
|
||||
- Allow multiple models with ordered fallback (error/size/timeout).
|
||||
|
||||
## High‑level behavior
|
||||
1) Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).
|
||||
2) For each enabled capability (image/audio/video), select attachments per policy (default: **first**).
|
||||
3) Choose the first eligible model entry (size + capability + auth).
|
||||
4) If a model fails or the media is too large, **fall back to the next entry**.
|
||||
5) On success:
|
||||
- `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
|
||||
- Audio sets `{{Transcript}}`; command parsing uses caption text when present,
|
||||
otherwise the transcript.
|
||||
- Captions are preserved as `User text:` inside the block.
|
||||
|
||||
If understanding fails or is disabled, **the reply flow continues** with the original body + attachments.
|
||||
|
||||
## Config overview
|
||||
`tools.media` supports **shared models** plus per‑capability overrides:
|
||||
- `tools.media.models`: shared model list (use `capabilities` to gate).
|
||||
- `tools.media.image` / `tools.media.audio` / `tools.media.video`:
|
||||
- defaults (`prompt`, `maxChars`, `maxBytes`, `timeoutSeconds`, `language`)
|
||||
- provider overrides (`baseUrl`, `headers`, `providerOptions`)
|
||||
- Deepgram audio options via `tools.media.audio.providerOptions.deepgram`
|
||||
- optional **per‑capability `models` list** (preferred before shared models)
|
||||
- `attachments` policy (`mode`, `maxAttachments`, `prefer`)
|
||||
- `scope` (optional gating by channel/chatType/session key)
|
||||
- `tools.media.concurrency`: max concurrent capability runs (default **2**).
|
||||
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
models: [ /* shared list */ ],
|
||||
image: { /* optional overrides */ },
|
||||
audio: { /* optional overrides */ },
|
||||
video: { /* optional overrides */ }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Model entries
|
||||
Each `models[]` entry can be **provider** or **CLI**:
|
||||
|
||||
```json5
|
||||
{
|
||||
type: "provider", // default if omitted
|
||||
provider: "openai",
|
||||
model: "gpt-5.2",
|
||||
prompt: "Describe the image in <= 500 chars.",
|
||||
maxChars: 500,
|
||||
maxBytes: 10485760,
|
||||
timeoutSeconds: 60,
|
||||
capabilities: ["image"], // optional, used for multi‑modal entries
|
||||
profile: "vision-profile",
|
||||
preferredProfile: "vision-fallback"
|
||||
}
|
||||
```
|
||||
|
||||
```json5
|
||||
{
|
||||
type: "cli",
|
||||
command: "gemini",
|
||||
args: [
|
||||
"-m",
|
||||
"gemini-3-flash",
|
||||
"--allowed-tools",
|
||||
"read_file",
|
||||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||||
],
|
||||
maxChars: 500,
|
||||
maxBytes: 52428800,
|
||||
timeoutSeconds: 120,
|
||||
capabilities: ["video", "image"]
|
||||
}
|
||||
```
|
||||
|
||||
CLI templates can also use:
|
||||
- `{{MediaDir}}` (directory containing the media file)
|
||||
- `{{OutputDir}}` (scratch dir created for this run)
|
||||
- `{{OutputBase}}` (scratch file base path, no extension)
|
||||
|
||||
## Defaults and limits
|
||||
Recommended defaults:
|
||||
- `maxChars`: **500** for image/video (short, command‑friendly)
|
||||
- `maxChars`: **unset** for audio (full transcript unless you set a limit)
|
||||
- `maxBytes`:
|
||||
- image: **10MB**
|
||||
- audio: **20MB**
|
||||
- video: **50MB**
|
||||
|
||||
Rules:
|
||||
- If media exceeds `maxBytes`, that model is skipped and the **next model is tried**.
|
||||
- If the model returns more than `maxChars`, output is trimmed.
|
||||
- `prompt` defaults to simple “Describe the {media}.” plus the `maxChars` guidance (image/video only).
|
||||
- If `<capability>.enabled: true` but no models are configured, Moltbot tries the
|
||||
**active reply model** when its provider supports the capability.
|
||||
|
||||
### Auto-detect media understanding (default)
|
||||
If `tools.media.<capability>.enabled` is **not** set to `false` and you haven’t
|
||||
configured models, Moltbot auto-detects in this order and **stops at the first
|
||||
working option**:
|
||||
|
||||
1) **Local CLIs** (audio only; if installed)
|
||||
- `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
|
||||
- `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
|
||||
- `whisper` (Python CLI; downloads models automatically)
|
||||
2) **Gemini CLI** (`gemini`) using `read_many_files`
|
||||
3) **Provider keys**
|
||||
- Audio: OpenAI → Groq → Deepgram → Google
|
||||
- Image: OpenAI → Anthropic → Google → MiniMax
|
||||
- Video: Google
|
||||
|
||||
To disable auto-detection, set:
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: false
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
|
||||
|
||||
## Capabilities (optional)
|
||||
If you set `capabilities`, the entry only runs for those media types. For shared
|
||||
lists, Moltbot can infer defaults:
|
||||
- `openai`, `anthropic`, `minimax`: **image**
|
||||
- `google` (Gemini API): **image + audio + video**
|
||||
- `groq`: **audio**
|
||||
- `deepgram`: **audio**
|
||||
|
||||
For CLI entries, **set `capabilities` explicitly** to avoid surprising matches.
|
||||
If you omit `capabilities`, the entry is eligible for the list it appears in.
|
||||
|
||||
## Provider support matrix (Moltbot integrations)
|
||||
| Capability | Provider integration | Notes |
|
||||
|------------|----------------------|-------|
|
||||
| Image | OpenAI / Anthropic / Google / others via `pi-ai` | Any image-capable model in the registry works. |
|
||||
| Audio | OpenAI, Groq, Deepgram, Google | Provider transcription (Whisper/Deepgram/Gemini). |
|
||||
| Video | Google (Gemini API) | Provider video understanding. |
|
||||
|
||||
## Recommended providers
|
||||
**Image**
|
||||
- Prefer your active model if it supports images.
|
||||
- Good defaults: `openai/gpt-5.2`, `anthropic/claude-opus-4-5`, `google/gemini-3-pro-preview`.
|
||||
|
||||
**Audio**
|
||||
- `openai/gpt-4o-mini-transcribe`, `groq/whisper-large-v3-turbo`, or `deepgram/nova-3`.
|
||||
- CLI fallback: `whisper-cli` (whisper-cpp) or `whisper`.
|
||||
- Deepgram setup: [Deepgram (audio transcription)](/providers/deepgram).
|
||||
|
||||
**Video**
|
||||
- `google/gemini-3-flash-preview` (fast), `google/gemini-3-pro-preview` (richer).
|
||||
- CLI fallback: `gemini` CLI (supports `read_file` on video/audio).
|
||||
|
||||
## Attachment policy
|
||||
Per‑capability `attachments` controls which attachments are processed:
|
||||
- `mode`: `first` (default) or `all`
|
||||
- `maxAttachments`: cap the number processed (default **1**)
|
||||
- `prefer`: `first`, `last`, `path`, `url`
|
||||
|
||||
When `mode: "all"`, outputs are labeled `[Image 1/2]`, `[Audio 2/2]`, etc.
|
||||
|
||||
## Config examples
|
||||
|
||||
### 1) Shared models list + overrides
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
models: [
|
||||
{ provider: "openai", model: "gpt-5.2", capabilities: ["image"] },
|
||||
{ provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"] },
|
||||
{
|
||||
type: "cli",
|
||||
command: "gemini",
|
||||
args: [
|
||||
"-m",
|
||||
"gemini-3-flash",
|
||||
"--allowed-tools",
|
||||
"read_file",
|
||||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||||
],
|
||||
capabilities: ["image", "video"]
|
||||
}
|
||||
],
|
||||
audio: {
|
||||
attachments: { mode: "all", maxAttachments: 2 }
|
||||
},
|
||||
video: {
|
||||
maxChars: 500
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2) Audio + Video only (image off)
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
audio: {
|
||||
enabled: true,
|
||||
models: [
|
||||
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
|
||||
{
|
||||
type: "cli",
|
||||
command: "whisper",
|
||||
args: ["--model", "base", "{{MediaPath}}"]
|
||||
}
|
||||
]
|
||||
},
|
||||
video: {
|
||||
enabled: true,
|
||||
maxChars: 500,
|
||||
models: [
|
||||
{ provider: "google", model: "gemini-3-flash-preview" },
|
||||
{
|
||||
type: "cli",
|
||||
command: "gemini",
|
||||
args: [
|
||||
"-m",
|
||||
"gemini-3-flash",
|
||||
"--allowed-tools",
|
||||
"read_file",
|
||||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3) Optional image understanding
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
image: {
|
||||
enabled: true,
|
||||
maxBytes: 10485760,
|
||||
maxChars: 500,
|
||||
models: [
|
||||
{ provider: "openai", model: "gpt-5.2" },
|
||||
{ provider: "anthropic", model: "claude-opus-4-5" },
|
||||
{
|
||||
type: "cli",
|
||||
command: "gemini",
|
||||
args: [
|
||||
"-m",
|
||||
"gemini-3-flash",
|
||||
"--allowed-tools",
|
||||
"read_file",
|
||||
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters."
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4) Multi‑modal single entry (explicit capabilities)
|
||||
```json5
|
||||
{
|
||||
tools: {
|
||||
media: {
|
||||
image: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
|
||||
audio: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] },
|
||||
video: { models: [{ provider: "google", model: "gemini-3-pro-preview", capabilities: ["image", "video", "audio"] }] }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Status output
|
||||
When media understanding runs, `/status` includes a short summary line:
|
||||
|
||||
```
|
||||
📎 Media: image ok (openai/gpt-5.2) · audio skipped (maxBytes)
|
||||
```
|
||||
|
||||
This shows per‑capability outcomes and the chosen provider/model when applicable.
|
||||
|
||||
## Notes
|
||||
- Understanding is **best‑effort**. Errors do not block replies.
|
||||
- Attachments are still passed to models even when understanding is disabled.
|
||||
- Use `scope` to limit where understanding runs (e.g. only DMs).
|
||||
|
||||
## Related docs
|
||||
- [Configuration](/gateway/configuration)
|
||||
- [Image & Media Support](/nodes/images)
|
||||
79
docker-compose/ez-assistant/docs/nodes/talk.md
Normal file
79
docker-compose/ez-assistant/docs/nodes/talk.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
summary: "Talk mode: continuous speech conversations with ElevenLabs TTS"
|
||||
read_when:
|
||||
- Implementing Talk mode on macOS/iOS/Android
|
||||
- Changing voice/TTS/interrupt behavior
|
||||
---
|
||||
# Talk Mode
|
||||
|
||||
Talk mode is a continuous voice conversation loop:
|
||||
1) Listen for speech
|
||||
2) Send transcript to the model (main session, chat.send)
|
||||
3) Wait for the response
|
||||
4) Speak it via ElevenLabs (streaming playback)
|
||||
|
||||
## Behavior (macOS)
|
||||
- **Always-on overlay** while Talk mode is enabled.
|
||||
- **Listening → Thinking → Speaking** phase transitions.
|
||||
- On a **short pause** (silence window), the current transcript is sent.
|
||||
- Replies are **written to WebChat** (same as typing).
|
||||
- **Interrupt on speech** (default on): if the user starts talking while the assistant is speaking, we stop playback and note the interruption timestamp for the next prompt.
|
||||
|
||||
## Voice directives in replies
|
||||
The assistant may prefix its reply with a **single JSON line** to control voice:
|
||||
|
||||
```json
|
||||
{"voice":"<voice-id>","once":true}
|
||||
```
|
||||
|
||||
Rules:
|
||||
- First non-empty line only.
|
||||
- Unknown keys are ignored.
|
||||
- `once: true` applies to the current reply only.
|
||||
- Without `once`, the voice becomes the new default for Talk mode.
|
||||
- The JSON line is stripped before TTS playback.
|
||||
|
||||
Supported keys:
|
||||
- `voice` / `voice_id` / `voiceId`
|
||||
- `model` / `model_id` / `modelId`
|
||||
- `speed`, `rate` (WPM), `stability`, `similarity`, `style`, `speakerBoost`
|
||||
- `seed`, `normalize`, `lang`, `output_format`, `latency_tier`
|
||||
- `once`
|
||||
|
||||
## Config (`~/.clawdbot/moltbot.json`)
|
||||
```json5
|
||||
{
|
||||
"talk": {
|
||||
"voiceId": "elevenlabs_voice_id",
|
||||
"modelId": "eleven_v3",
|
||||
"outputFormat": "mp3_44100_128",
|
||||
"apiKey": "elevenlabs_api_key",
|
||||
"interruptOnSpeech": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Defaults:
|
||||
- `interruptOnSpeech`: true
|
||||
- `voiceId`: falls back to `ELEVENLABS_VOICE_ID` / `SAG_VOICE_ID` (or first ElevenLabs voice when API key is available)
|
||||
- `modelId`: defaults to `eleven_v3` when unset
|
||||
- `apiKey`: falls back to `ELEVENLABS_API_KEY` (or gateway shell profile if available)
|
||||
- `outputFormat`: defaults to `pcm_44100` on macOS/iOS and `pcm_24000` on Android (set `mp3_*` to force MP3 streaming)
|
||||
|
||||
## macOS UI
|
||||
- Menu bar toggle: **Talk**
|
||||
- Config tab: **Talk Mode** group (voice id + interrupt toggle)
|
||||
- Overlay:
|
||||
- **Listening**: cloud pulses with mic level
|
||||
- **Thinking**: sinking animation
|
||||
- **Speaking**: radiating rings
|
||||
- Click cloud: stop speaking
|
||||
- Click X: exit Talk mode
|
||||
|
||||
## Notes
|
||||
- Requires Speech + Microphone permissions.
|
||||
- Uses `chat.send` against session key `main`.
|
||||
- TTS uses ElevenLabs streaming API with `ELEVENLABS_API_KEY` and incremental playback on macOS/iOS/Android for lower latency.
|
||||
- `stability` for `eleven_v3` is validated to `0.0`, `0.5`, or `1.0`; other models accept `0..1`.
|
||||
- `latency_tier` is validated to `0..4` when set.
|
||||
- Android supports `pcm_16000`, `pcm_22050`, `pcm_24000`, and `pcm_44100` output formats for low-latency AudioTrack streaming.
|
||||
61
docker-compose/ez-assistant/docs/nodes/voicewake.md
Normal file
61
docker-compose/ez-assistant/docs/nodes/voicewake.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
summary: "Global voice wake words (Gateway-owned) and how they sync across nodes"
|
||||
read_when:
|
||||
- Changing voice wake words behavior or defaults
|
||||
- Adding new node platforms that need wake word sync
|
||||
---
|
||||
# Voice Wake (Global Wake Words)
|
||||
|
||||
Moltbot treats **wake words as a single global list** owned by the **Gateway**.
|
||||
|
||||
- There are **no per-node custom wake words**.
|
||||
- **Any node/app UI may edit** the list; changes are persisted by the Gateway and broadcast to everyone.
|
||||
- Each device still keeps its own **Voice Wake enabled/disabled** toggle (local UX + permissions differ).
|
||||
|
||||
## Storage (Gateway host)
|
||||
|
||||
Wake words are stored on the gateway machine at:
|
||||
|
||||
- `~/.clawdbot/settings/voicewake.json`
|
||||
|
||||
Shape:
|
||||
|
||||
```json
|
||||
{ "triggers": ["clawd", "claude", "computer"], "updatedAtMs": 1730000000000 }
|
||||
```
|
||||
|
||||
## Protocol
|
||||
|
||||
### Methods
|
||||
|
||||
- `voicewake.get` → `{ triggers: string[] }`
|
||||
- `voicewake.set` with params `{ triggers: string[] }` → `{ triggers: string[] }`
|
||||
|
||||
Notes:
|
||||
- Triggers are normalized (trimmed, empties dropped). Empty lists fall back to defaults.
|
||||
- Limits are enforced for safety (count/length caps).
|
||||
|
||||
### Events
|
||||
|
||||
- `voicewake.changed` payload `{ triggers: string[] }`
|
||||
|
||||
Who receives it:
|
||||
- All WebSocket clients (macOS app, WebChat, etc.)
|
||||
- All connected nodes (iOS/Android), and also on node connect as an initial “current state” push.
|
||||
|
||||
## Client behavior
|
||||
|
||||
### macOS app
|
||||
|
||||
- Uses the global list to gate `VoiceWakeRuntime` triggers.
|
||||
- Editing “Trigger words” in Voice Wake settings calls `voicewake.set` and then relies on the broadcast to keep other clients in sync.
|
||||
|
||||
### iOS node
|
||||
|
||||
- Uses the global list for `VoiceWakeManager` trigger detection.
|
||||
- Editing Wake Words in Settings calls `voicewake.set` (over the Gateway WS) and also keeps local wake-word detection responsive.
|
||||
|
||||
### Android node
|
||||
|
||||
- Exposes a Wake Words editor in Settings.
|
||||
- Calls `voicewake.set` over the Gateway WS so edits sync everywhere.
|
||||
Reference in New Issue
Block a user