Add ez-assistant and kerberos service folders

2026-02-11 14:56:03 -05:00
parent e4e8ae1b87
commit 9ccfb36923
4471 changed files with 746463 additions and 0 deletions
--- a/docker-compose/ez-assistant/docs/refactor/clawnet.md
+++ b/docker-compose/ez-assistant/docs/refactor/clawnet.md
@@ -0,0 +1,360 @@
+---
+summary: "Clawnet refactor: unify network protocol, roles, auth, approvals, identity"
+read_when:
+  - Planning a unified network protocol for nodes + operator clients
+  - Reworking approvals, pairing, TLS, and presence across devices
+---
+# Clawnet refactor (protocol + auth unification)
+
+## Hi
+Hi Peter — great direction; this unlocks simpler UX + stronger security.
+
+## Purpose
+Single, rigorous document for:
+- Current state: protocols, flows, trust boundaries.
+- Pain points: approvals, multi‑hop routing, UI duplication.
+- Proposed new state: one protocol, scoped roles, unified auth/pairing, TLS pinning.
+- Identity model: stable IDs + cute slugs.
+- Migration plan, risks, open questions.
+
+## Goals (from discussion)
+- One protocol for all clients (mac app, CLI, iOS, Android, headless node).
+- Every network participant authenticated + paired.
+- Role clarity: nodes vs operators.
+- Central approvals routed to where the user is.
+- TLS encryption + optional pinning for all remote traffic.
+- Minimal code duplication.
+- Single machine should appear once (no UI/node duplicate entry).
+
+## Non‑goals (explicit)
+- Remove capability separation (still need least‑privilege).
+- Expose full gateway control plane without scope checks.
+- Make auth depend on human labels (slugs remain non‑security).
+
+---
+
+# Current state (as‑is)
+
+## Two protocols
+
+### 1) Gateway WebSocket (control plane)
+- Full API surface: config, channels, models, sessions, agent runs, logs, nodes, etc.
+- Default bind: loopback. Remote access via SSH/Tailscale.
+- Auth: token/password via `connect`.
+- No TLS pinning (relies on loopback/tunnel).
+- Code:
+  - `src/gateway/server/ws-connection/message-handler.ts`
+  - `src/gateway/client.ts`
+  - `docs/gateway/protocol.md`
+
+### 2) Bridge (node transport)
+- Narrow allowlist surface, node identity + pairing.
+- JSONL over TCP; optional TLS + cert fingerprint pinning.
+- TLS advertises fingerprint in discovery TXT.
+- Code:
+  - `src/infra/bridge/server/connection.ts`
+  - `src/gateway/server-bridge.ts`
+  - `src/node-host/bridge-client.ts`
+  - `docs/gateway/bridge-protocol.md`
+
+## Control plane clients today
+- CLI → Gateway WS via `callGateway` (`src/gateway/call.ts`).
+- macOS app UI → Gateway WS (`GatewayConnection`).
+- Web Control UI → Gateway WS.
+- ACP → Gateway WS.
+- Browser control uses its own HTTP control server.
+
+## Nodes today
+- macOS app in node mode connects to Gateway bridge (`MacNodeBridgeSession`).
+- iOS/Android apps connect to Gateway bridge.
+- Pairing + per‑node token stored on gateway.
+
+## Current approval flow (exec)
+- Agent uses `system.run` via Gateway.
+- Gateway invokes node over bridge.
+- Node runtime decides approval.
+- UI prompt shown by mac app (when node == mac app).
+- Node returns `invoke-res` to Gateway.
+- Multi‑hop, UI tied to node host.
+
+## Presence + identity today
+- Gateway presence entries from WS clients.
+- Node presence entries from bridge.
+- mac app can show two entries for same machine (UI + node).
+- Node identity stored in pairing store; UI identity separate.
+
+---
+
+# Problems / pain points
+
+- Two protocol stacks to maintain (WS + Bridge).
+- Approvals on remote nodes: prompt appears on node host, not where user is.
+- TLS pinning only exists for bridge; WS depends on SSH/Tailscale.
+- Identity duplication: same machine shows as multiple instances.
+- Ambiguous roles: UI + node + CLI capabilities not clearly separated.
+
+---
+
+# Proposed new state (Clawnet)
+
+## One protocol, two roles
+Single WS protocol with role + scope.
+- **Role: node** (capability host)
+- **Role: operator** (control plane)
+- Optional **scope** for operator:
+  - `operator.read` (status + viewing)
+  - `operator.write` (agent run, sends)
+  - `operator.admin` (config, channels, models)
+
+### Role behaviors
+
+**Node**
+- Can register capabilities (`caps`, `commands`, permissions).
+- Can receive `invoke` commands (`system.run`, `camera.*`, `canvas.*`, `screen.record`, etc).
+- Can send events: `voice.transcript`, `agent.request`, `chat.subscribe`.
+- Cannot call config/models/channels/sessions/agent control plane APIs.
+
+**Operator**
+- Full control plane API, gated by scope.
+- Receives all approvals.
+- Does not directly execute OS actions; routes to nodes.
+
+### Key rule
+Role is per‑connection, not per device. A device may open both roles, separately.
+
+---
+
+# Unified authentication + pairing
+
+## Client identity
+Every client provides:
+- `deviceId` (stable, derived from device key).
+- `displayName` (human name).
+- `role` + `scope` + `caps` + `commands`.
+
+## Pairing flow (unified)
+- Client connects unauthenticated.
+- Gateway creates a **pairing request** for that `deviceId`.
+- Operator receives prompt; approves/denies.
+- Gateway issues credentials bound to:
+  - device public key
+  - role(s)
+  - scope(s)
+  - capabilities/commands
+- Client persists token, reconnects authenticated.
+
+## Device‑bound auth (avoid bearer token replay)
+Preferred: device keypairs.
+- Device generates keypair once.
+- `deviceId = fingerprint(publicKey)`.
+- Gateway sends nonce; device signs; gateway verifies.
+- Tokens are issued to a public key (proof‑of‑possession), not a string.
+
+Alternatives:
+- mTLS (client certs): strongest, more ops complexity.
+- Short‑lived bearer tokens only as a temporary phase (rotate + revoke early).
+
+## Silent approval (SSH heuristic)
+Define it precisely to avoid a weak link. Prefer one:
+- **Local‑only**: auto‑pair when client connects via loopback/Unix socket.
+- **Challenge via SSH**: gateway issues nonce; client proves SSH by fetching it.
+- **Physical presence window**: after a local approval on gateway host UI, allow auto‑pair for a short window (e.g. 10 minutes).
+
+Always log + record auto‑approvals.
+
+---
+
+# TLS everywhere (dev + prod)
+
+## Reuse existing bridge TLS
+Use current TLS runtime + fingerprint pinning:
+- `src/infra/bridge/server/tls.ts`
+- fingerprint verification logic in `src/node-host/bridge-client.ts`
+
+## Apply to WS
+- WS server supports TLS with same cert/key + fingerprint.
+- WS clients can pin fingerprint (optional).
+- Discovery advertises TLS + fingerprint for all endpoints.
+  - Discovery is locator hints only; never a trust anchor.
+
+## Why
+- Reduce reliance on SSH/Tailscale for confidentiality.
+- Make remote mobile connections safe by default.
+
+---
+
+# Approvals redesign (centralized)
+
+## Current
+Approval happens on node host (mac app node runtime). Prompt appears where node runs.
+
+## Proposed
+Approval is **gateway‑hosted**, UI delivered to operator clients.
+
+### New flow
+1) Gateway receives `system.run` intent (agent).
+2) Gateway creates approval record: `approval.requested`.
+3) Operator UI(s) show prompt.
+4) Approval decision sent to gateway: `approval.resolve`.
+5) Gateway invokes node command if approved.
+6) Node executes, returns `invoke-res`.
+
+### Approval semantics (hardening)
+- Broadcast to all operators; only the active UI shows a modal (others get a toast).
+- First resolution wins; gateway rejects subsequent resolves as already settled.
+- Default timeout: deny after N seconds (e.g. 60s), log reason.
+- Resolution requires `operator.approvals` scope.
+
+## Benefits
+- Prompt appears where user is (mac/phone).
+- Consistent approvals for remote nodes.
+- Node runtime stays headless; no UI dependency.
+
+---
+
+# Role clarity examples
+
+## iPhone app
+- **Node role** for: mic, camera, voice chat, location, push‑to‑talk.
+- Optional **operator.read** for status and chat view.
+- Optional **operator.write/admin** only when explicitly enabled.
+
+## macOS app
+- Operator role by default (control UI).
+- Node role when “Mac node” enabled (system.run, screen, camera).
+- Same deviceId for both connections → merged UI entry.
+
+## CLI
+- Operator role always.
+- Scope derived by subcommand:
+  - `status`, `logs` → read
+  - `agent`, `message` → write
+  - `config`, `channels` → admin
+  - approvals + pairing → `operator.approvals` / `operator.pairing`
+
+---
+
+# Identity + slugs
+
+## Stable ID
+Required for auth; never changes.
+Preferred:
+- Keypair fingerprint (public key hash).
+
+## Cute slug (lobster‑themed)
+Human label only.
+- Example: `scarlet-claw`, `saltwave`, `mantis-pinch`.
+- Stored in gateway registry, editable.
+- Collision handling: `-2`, `-3`.
+
+## UI grouping
+Same `deviceId` across roles → single “Instance” row:
+- Badge: `operator`, `node`.
+- Shows capabilities + last seen.
+
+---
+
+# Migration strategy
+
+## Phase 0: Document + align
+- Publish this doc.
+- Inventory all protocol calls + approval flows.
+
+## Phase 1: Add roles/scopes to WS
+- Extend `connect` params with `role`, `scope`, `deviceId`.
+- Add allowlist gating for node role.
+
+## Phase 2: Bridge compatibility
+- Keep bridge running.
+- Add WS node support in parallel.
+- Gate features behind config flag.
+
+## Phase 3: Central approvals
+- Add approval request + resolve events in WS.
+- Update mac app UI to prompt + respond.
+- Node runtime stops prompting UI.
+
+## Phase 4: TLS unification
+- Add TLS config for WS using bridge TLS runtime.
+- Add pinning to clients.
+
+## Phase 5: Deprecate bridge
+- Migrate iOS/Android/mac node to WS.
+- Keep bridge as fallback; remove once stable.
+
+## Phase 6: Device‑bound auth
+- Require key‑based identity for all non‑local connections.
+- Add revocation + rotation UI.
+
+---
+
+# Security notes
+
+- Role/allowlist enforced at gateway boundary.
+- No client gets “full” API without operator scope.
+- Pairing required for *all* connections.
+- TLS + pinning reduces MITM risk for mobile.
+- SSH silent approval is a convenience; still recorded + revocable.
+- Discovery is never a trust anchor.
+- Capability claims are verified against server allowlists by platform/type.
+
+# Streaming + large payloads (node media)
+WS control plane is fine for small messages, but nodes also do:
+- camera clips
+- screen recordings
+- audio streams
+
+Options:
+1) WS binary frames + chunking + backpressure rules.
+2) Separate streaming endpoint (still TLS + auth).
+3) Keep bridge longer for media‑heavy commands, migrate last.
+
+Pick one before implementation to avoid drift.
+
+# Capability + command policy
+- Node‑reported caps/commands are treated as **claims**.
+- Gateway enforces per‑platform allowlists.
+- Any new command requires operator approval or explicit allowlist change.
+- Audit changes with timestamps.
+
+# Audit + rate limiting
+- Log: pairing requests, approvals/denials, token issuance/rotation/revocation.
+- Rate‑limit pairing spam and approval prompts.
+
+# Protocol hygiene
+- Explicit protocol version + error codes.
+- Reconnect rules + heartbeat policy.
+- Presence TTL and last‑seen semantics.
+
+---
+
+# Open questions
+
+1) Single device running both roles: token model
+   - Recommend separate tokens per role (node vs operator).
+   - Same deviceId; different scopes; clearer revocation.
+
+2) Operator scope granularity
+   - read/write/admin + approvals + pairing (minimum viable).
+   - Consider per‑feature scopes later.
+
+3) Token rotation + revocation UX
+   - Auto‑rotate on role change.
+   - UI to revoke by deviceId + role.
+
+4) Discovery
+   - Extend current Bonjour TXT to include WS TLS fingerprint + role hints.
+   - Treat as locator hints only.
+
+5) Cross‑network approval
+   - Broadcast to all operator clients; active UI shows modal.
+   - First response wins; gateway enforces atomicity.
+
+---
+
+# Summary (TL;DR)
+
+- Today: WS control plane + Bridge node transport.
+- Pain: approvals + duplication + two stacks.
+- Proposal: one WS protocol with explicit roles + scopes, unified pairing + TLS pinning, gateway‑hosted approvals, stable device IDs + cute slugs.
+- Outcome: simpler UX, stronger security, less duplication, better mobile routing.
--- a/docker-compose/ez-assistant/docs/refactor/exec-host.md
+++ b/docker-compose/ez-assistant/docs/refactor/exec-host.md
@@ -0,0 +1,265 @@
+---
+summary: "Refactor plan: exec host routing, node approvals, and headless runner"
+read_when:
+  - Designing exec host routing or exec approvals
+  - Implementing node runner + UI IPC
+  - Adding exec host security modes and slash commands
+---
+
+# Exec host refactor plan
+
+## Goals
+- Add `exec.host` + `exec.security` to route execution across **sandbox**, **gateway**, and **node**.
+- Keep defaults **safe**: no cross-host execution unless explicitly enabled.
+- Split execution into a **headless runner service** with optional UI (macOS app) via local IPC.
+- Provide **per-agent** policy, allowlist, ask mode, and node binding.
+- Support **ask modes** that work *with* or *without* allowlists.
+- Cross-platform: Unix socket + token auth (macOS/Linux/Windows parity).
+
+## Non-goals
+- No legacy allowlist migration or legacy schema support.
+- No PTY/streaming for node exec (aggregated output only).
+- No new network layer beyond the existing Bridge + Gateway.
+
+## Decisions (locked)
+- **Config keys:** `exec.host` + `exec.security` (per-agent override allowed).
+- **Elevation:** keep `/elevated` as an alias for gateway full access.
+- **Ask default:** `on-miss`.
+- **Approvals store:** `~/.clawdbot/exec-approvals.json` (JSON, no legacy migration).
+- **Runner:** headless system service; UI app hosts a Unix socket for approvals.
+- **Node identity:** use existing `nodeId`.
+- **Socket auth:** Unix socket + token (cross-platform); split later if needed.
+- **Node host state:** `~/.clawdbot/node.json` (node id + pairing token).
+- **macOS exec host:** run `system.run` inside the macOS app; node host service forwards requests over local IPC.
+- **No XPC helper:** stick to Unix socket + token + peer checks.
+
+## Key concepts
+### Host
+- `sandbox`: Docker exec (current behavior).
+- `gateway`: exec on gateway host.
+- `node`: exec on node runner via Bridge (`system.run`).
+
+### Security mode
+- `deny`: always block.
+- `allowlist`: allow only matches.
+- `full`: allow everything (equivalent to elevated).
+
+### Ask mode
+- `off`: never ask.
+- `on-miss`: ask only when allowlist does not match.
+- `always`: ask every time.
+
+Ask is **independent** of allowlist; allowlist can be used with `always` or `on-miss`.
+
+### Policy resolution (per exec)
+1) Resolve `exec.host` (tool param → agent override → global default).
+2) Resolve `exec.security` and `exec.ask` (same precedence).
+3) If host is `sandbox`, proceed with local sandbox exec.
+4) If host is `gateway` or `node`, apply security + ask policy on that host.
+
+## Default safety
+- Default `exec.host = sandbox`.
+- Default `exec.security = deny` for `gateway` and `node`.
+- Default `exec.ask = on-miss` (only relevant if security allows).
+- If no node binding is set, **agent may target any node**, but only if policy allows it.
+
+## Config surface
+### Tool parameters
+- `exec.host` (optional): `sandbox | gateway | node`.
+- `exec.security` (optional): `deny | allowlist | full`.
+- `exec.ask` (optional): `off | on-miss | always`.
+- `exec.node` (optional): node id/name to use when `host=node`.
+
+### Config keys (global)
+- `tools.exec.host`
+- `tools.exec.security`
+- `tools.exec.ask`
+- `tools.exec.node` (default node binding)
+
+### Config keys (per agent)
+- `agents.list[].tools.exec.host`
+- `agents.list[].tools.exec.security`
+- `agents.list[].tools.exec.ask`
+- `agents.list[].tools.exec.node`
+
+### Alias
+- `/elevated on` = set `tools.exec.host=gateway`, `tools.exec.security=full` for the agent session.
+- `/elevated off` = restore previous exec settings for the agent session.
+
+## Approvals store (JSON)
+Path: `~/.clawdbot/exec-approvals.json`
+
+Purpose:
+- Local policy + allowlists for the **execution host** (gateway or node runner).
+- Ask fallback when no UI is available.
+- IPC credentials for UI clients.
+
+Proposed schema (v1):
+```json
+{
+  "version": 1,
+  "socket": {
+    "path": "~/.clawdbot/exec-approvals.sock",
+    "token": "base64-opaque-token"
+  },
+  "defaults": {
+    "security": "deny",
+    "ask": "on-miss",
+    "askFallback": "deny"
+  },
+  "agents": {
+    "agent-id-1": {
+      "security": "allowlist",
+      "ask": "on-miss",
+      "allowlist": [
+        {
+          "pattern": "~/Projects/**/bin/rg",
+          "lastUsedAt": 0,
+          "lastUsedCommand": "rg -n TODO",
+          "lastResolvedPath": "/Users/user/Projects/.../bin/rg"
+        }
+      ]
+    }
+  }
+}
+```
+Notes:
+- No legacy allowlist formats.
+- `askFallback` applies only when `ask` is required and no UI is reachable.
+- File permissions: `0600`.
+
+## Runner service (headless)
+### Role
+- Enforce `exec.security` + `exec.ask` locally.
+- Execute system commands and return output.
+- Emit Bridge events for exec lifecycle (optional but recommended).
+
+### Service lifecycle
+- Launchd/daemon on macOS; system service on Linux/Windows.
+- Approvals JSON is local to the execution host.
+- UI hosts a local Unix socket; runners connect on demand.
+
+## UI integration (macOS app)
+### IPC
+- Unix socket at `~/.clawdbot/exec-approvals.sock` (0600).
+- Token stored in `exec-approvals.json` (0600).
+- Peer checks: same-UID only.
+- Challenge/response: nonce + HMAC(token, request-hash) to prevent replay.
+- Short TTL (e.g., 10s) + max payload + rate limit.
+
+### Ask flow (macOS app exec host)
+1) Node service receives `system.run` from gateway.
+2) Node service connects to the local socket and sends the prompt/exec request.
+3) App validates peer + token + HMAC + TTL, then shows dialog if needed.
+4) App executes the command in UI context and returns output.
+5) Node service returns output to gateway.
+
+If UI missing:
+- Apply `askFallback` (`deny|allowlist|full`).
+
+### Diagram (SCI)
+```
+Agent -> Gateway -> Bridge -> Node Service (TS)
+                         |  IPC (UDS + token + HMAC + TTL)
+                         v
+                     Mac App (UI + TCC + system.run)
+```
+
+## Node identity + binding
+- Use existing `nodeId` from Bridge pairing.
+- Binding model:
+  - `tools.exec.node` restricts the agent to a specific node.
+  - If unset, agent can pick any node (policy still enforces defaults).
+- Node selection resolution:
+  - `nodeId` exact match
+  - `displayName` (normalized)
+  - `remoteIp`
+  - `nodeId` prefix (>= 6 chars)
+
+## Eventing
+### Who sees events
+- System events are **per session** and shown to the agent on the next prompt.
+- Stored in the gateway in-memory queue (`enqueueSystemEvent`).
+
+### Event text
+- `Exec started (node=<id>, id=<runId>)`
+- `Exec finished (node=<id>, id=<runId>, code=<code>)` + optional output tail
+- `Exec denied (node=<id>, id=<runId>, <reason>)`
+
+### Transport
+Option A (recommended):
+- Runner sends Bridge `event` frames `exec.started` / `exec.finished`.
+- Gateway `handleBridgeEvent` maps these into `enqueueSystemEvent`.
+
+Option B:
+- Gateway `exec` tool handles lifecycle directly (synchronous only).
+
+## Exec flows
+### Sandbox host
+- Existing `exec` behavior (Docker or host when unsandboxed).
+- PTY supported in non-sandbox mode only.
+
+### Gateway host
+- Gateway process executes on its own machine.
+- Enforces local `exec-approvals.json` (security/ask/allowlist).
+
+### Node host
+- Gateway calls `node.invoke` with `system.run`.
+- Runner enforces local approvals.
+- Runner returns aggregated stdout/stderr.
+- Optional Bridge events for start/finish/deny.
+
+## Output caps
+- Cap combined stdout+stderr at **200k**; keep **tail 20k** for events.
+- Truncate with a clear suffix (e.g., `"… (truncated)"`).
+
+## Slash commands
+- `/exec host=<sandbox|gateway|node> security=<deny|allowlist|full> ask=<off|on-miss|always> node=<id>`
+- Per-agent, per-session overrides; non-persistent unless saved via config.
+- `/elevated on|off|ask|full` remains a shortcut for `host=gateway security=full` (with `full` skipping approvals).
+
+## Cross-platform story
+- The runner service is the portable execution target.
+- UI is optional; if missing, `askFallback` applies.
+- Windows/Linux support the same approvals JSON + socket protocol.
+
+## Implementation phases
+### Phase 1: config + exec routing
+- Add config schema for `exec.host`, `exec.security`, `exec.ask`, `exec.node`.
+- Update tool plumbing to respect `exec.host`.
+- Add `/exec` slash command and keep `/elevated` alias.
+
+### Phase 2: approvals store + gateway enforcement
+- Implement `exec-approvals.json` reader/writer.
+- Enforce allowlist + ask modes for `gateway` host.
+- Add output caps.
+
+### Phase 3: node runner enforcement
+- Update node runner to enforce allowlist + ask.
+- Add Unix socket prompt bridge to macOS app UI.
+- Wire `askFallback`.
+
+### Phase 4: events
+- Add node → gateway Bridge events for exec lifecycle.
+- Map to `enqueueSystemEvent` for agent prompts.
+
+### Phase 5: UI polish
+- Mac app: allowlist editor, per-agent switcher, ask policy UI.
+- Node binding controls (optional).
+
+## Testing plan
+- Unit tests: allowlist matching (glob + case-insensitive).
+- Unit tests: policy resolution precedence (tool param → agent override → global).
+- Integration tests: node runner deny/allow/ask flows.
+- Bridge event tests: node event → system event routing.
+
+## Open risks
+- UI unavailability: ensure `askFallback` is respected.
+- Long-running commands: rely on timeout + output caps.
+- Multi-node ambiguity: error unless node binding or explicit node param.
+
+## Related docs
+- [Exec tool](/tools/exec)
+- [Exec approvals](/tools/exec-approvals)
+- [Nodes](/nodes)
+- [Elevated mode](/tools/elevated)
--- a/docker-compose/ez-assistant/docs/refactor/outbound-session-mirroring.md
+++ b/docker-compose/ez-assistant/docs/refactor/outbound-session-mirroring.md
@@ -0,0 +1,75 @@
+---
+title: Outbound Session Mirroring Refactor (Issue #1520)
+description: Track outbound session mirroring refactor notes, decisions, tests, and open items.
+---
+
+# Outbound Session Mirroring Refactor (Issue #1520)
+
+## Status
+- In progress.
+- Core + plugin channel routing updated for outbound mirroring.
+- Gateway send now derives target session when sessionKey is omitted.
+
+## Context
+Outbound sends were mirrored into the *current* agent session (tool session key) rather than the target channel session. Inbound routing uses channel/peer session keys, so outbound responses landed in the wrong session and first-contact targets often lacked session entries.
+
+## Goals
+- Mirror outbound messages into the target channel session key.
+- Create session entries on outbound when missing.
+- Keep thread/topic scoping aligned with inbound session keys.
+- Cover core channels plus bundled extensions.
+
+## Implementation Summary
+- New outbound session routing helper:
+  - `src/infra/outbound/outbound-session.ts`
+  - `resolveOutboundSessionRoute` builds target sessionKey using `buildAgentSessionKey` (dmScope + identityLinks).
+  - `ensureOutboundSessionEntry` writes minimal `MsgContext` via `recordSessionMetaFromInbound`.
+- `runMessageAction` (send) derives target sessionKey and passes it to `executeSendAction` for mirroring.
+- `message-tool` no longer mirrors directly; it only resolves agentId from the current session key.
+- Plugin send path mirrors via `appendAssistantMessageToSessionTranscript` using the derived sessionKey.
+- Gateway send derives a target session key when none is provided (default agent), and ensures a session entry.
+
+## Thread/Topic Handling
+- Slack: replyTo/threadId -> `resolveThreadSessionKeys` (suffix).
+- Discord: threadId/replyTo -> `resolveThreadSessionKeys` with `useSuffix=false` to match inbound (thread channel id already scopes session).
+- Telegram: topic IDs map to `chatId:topic:<id>` via `buildTelegramGroupPeerId`.
+
+## Extensions Covered
+- Matrix, MS Teams, Mattermost, BlueBubbles, Nextcloud Talk, Zalo, Zalo Personal, Nostr, Tlon.
+- Notes:
+  - Mattermost targets now strip `@` for DM session key routing.
+  - Zalo Personal uses DM peer kind for 1:1 targets (group only when `group:` is present).
+  - BlueBubbles group targets strip `chat_*` prefixes to match inbound session keys.
+  - Slack auto-thread mirroring matches channel ids case-insensitively.
+  - Gateway send lowercases provided session keys before mirroring.
+
+## Decisions
+- **Gateway send session derivation**: if `sessionKey` is provided, use it. If omitted, derive a sessionKey from target + default agent and mirror there.
+- **Session entry creation**: always use `recordSessionMetaFromInbound` with `Provider/From/To/ChatType/AccountId/Originating*` aligned to inbound formats.
+- **Target normalization**: outbound routing uses resolved targets (post `resolveChannelTarget`) when available.
+- **Session key casing**: canonicalize session keys to lowercase on write and during migrations.
+
+## Tests Added/Updated
+- `src/infra/outbound/outbound-session.test.ts`
+  - Slack thread session key.
+  - Telegram topic session key.
+  - dmScope identityLinks with Discord.
+- `src/agents/tools/message-tool.test.ts`
+  - Derives agentId from session key (no sessionKey passed through).
+- `src/gateway/server-methods/send.test.ts`
+  - Derives session key when omitted and creates session entry.
+
+## Open Items / Follow-ups
+- Voice-call plugin uses custom `voice:<phone>` session keys. Outbound mapping is not standardized here; if message-tool should support voice-call sends, add explicit mapping.
+- Confirm if any external plugin uses non-standard `From/To` formats beyond the bundled set.
+
+## Files Touched
+- `src/infra/outbound/outbound-session.ts`
+- `src/infra/outbound/outbound-send-service.ts`
+- `src/infra/outbound/message-action-runner.ts`
+- `src/agents/tools/message-tool.ts`
+- `src/gateway/server-methods/send.ts`
+- Tests in:
+  - `src/infra/outbound/outbound-session.test.ts`
+  - `src/agents/tools/message-tool.test.ts`
+  - `src/gateway/server-methods/send.test.ts`
--- a/docker-compose/ez-assistant/docs/refactor/plugin-sdk.md
+++ b/docker-compose/ez-assistant/docs/refactor/plugin-sdk.md
@@ -0,0 +1,187 @@
+---
+summary: "Plan: one clean plugin SDK + runtime for all messaging connectors"
+read_when:
+  - Defining or refactoring the plugin architecture
+  - Migrating channel connectors to the plugin SDK/runtime
+---
+# Plugin SDK + Runtime Refactor Plan
+
+Goal: every messaging connector is a plugin (bundled or external) using one stable API.
+No plugin imports from `src/**` directly. All dependencies go through the SDK or runtime.
+
+## Why now
+- Current connectors mix patterns: direct core imports, dist-only bridges, and custom helpers.
+- This makes upgrades brittle and blocks a clean external plugin surface.
+
+## Target architecture (two layers)
+
+### 1) Plugin SDK (compile-time, stable, publishable)
+Scope: types, helpers, and config utilities. No runtime state, no side effects.
+
+Contents (examples):
+- Types: `ChannelPlugin`, adapters, `ChannelMeta`, `ChannelCapabilities`, `ChannelDirectoryEntry`.
+- Config helpers: `buildChannelConfigSchema`, `setAccountEnabledInConfigSection`, `deleteAccountFromConfigSection`,
+  `applyAccountNameToChannelSection`.
+- Pairing helpers: `PAIRING_APPROVED_MESSAGE`, `formatPairingApproveHint`.
+- Onboarding helpers: `promptChannelAccessConfig`, `addWildcardAllowFrom`, onboarding types.
+- Tool param helpers: `createActionGate`, `readStringParam`, `readNumberParam`, `readReactionParams`, `jsonResult`.
+- Docs link helper: `formatDocsLink`.
+
+Delivery:
+- Publish as `@clawdbot/plugin-sdk` (or export from core under `clawdbot/plugin-sdk`).
+- Semver with explicit stability guarantees.
+
+### 2) Plugin Runtime (execution surface, injected)
+Scope: everything that touches core runtime behavior.
+Accessed via `MoltbotPluginApi.runtime` so plugins never import `src/**`.
+
+Proposed surface (minimal but complete):
+```ts
+export type PluginRuntime = {
+  channel: {
+    text: {
+      chunkMarkdownText(text: string, limit: number): string[];
+      resolveTextChunkLimit(cfg: MoltbotConfig, channel: string, accountId?: string): number;
+      hasControlCommand(text: string, cfg: MoltbotConfig): boolean;
+    };
+    reply: {
+      dispatchReplyWithBufferedBlockDispatcher(params: {
+        ctx: unknown;
+        cfg: unknown;
+        dispatcherOptions: {
+          deliver: (payload: { text?: string; mediaUrls?: string[]; mediaUrl?: string }) =>
+            void | Promise<void>;
+          onError?: (err: unknown, info: { kind: string }) => void;
+        };
+      }): Promise<void>;
+      createReplyDispatcherWithTyping?: unknown; // adapter for Teams-style flows
+    };
+    routing: {
+      resolveAgentRoute(params: {
+        cfg: unknown;
+        channel: string;
+        accountId: string;
+        peer: { kind: "dm" | "group" | "channel"; id: string };
+      }): { sessionKey: string; accountId: string };
+    };
+    pairing: {
+      buildPairingReply(params: { channel: string; idLine: string; code: string }): string;
+      readAllowFromStore(channel: string): Promise<string[]>;
+      upsertPairingRequest(params: {
+        channel: string;
+        id: string;
+        meta?: { name?: string };
+      }): Promise<{ code: string; created: boolean }>;
+    };
+    media: {
+      fetchRemoteMedia(params: { url: string }): Promise<{ buffer: Buffer; contentType?: string }>;
+      saveMediaBuffer(
+        buffer: Uint8Array,
+        contentType: string | undefined,
+        direction: "inbound" | "outbound",
+        maxBytes: number,
+      ): Promise<{ path: string; contentType?: string }>;
+    };
+    mentions: {
+      buildMentionRegexes(cfg: MoltbotConfig, agentId?: string): RegExp[];
+      matchesMentionPatterns(text: string, regexes: RegExp[]): boolean;
+    };
+    groups: {
+      resolveGroupPolicy(cfg: MoltbotConfig, channel: string, accountId: string, groupId: string): {
+        allowlistEnabled: boolean;
+        allowed: boolean;
+        groupConfig?: unknown;
+        defaultConfig?: unknown;
+      };
+      resolveRequireMention(
+        cfg: MoltbotConfig,
+        channel: string,
+        accountId: string,
+        groupId: string,
+        override?: boolean,
+      ): boolean;
+    };
+    debounce: {
+      createInboundDebouncer<T>(opts: {
+        debounceMs: number;
+        buildKey: (v: T) => string | null;
+        shouldDebounce: (v: T) => boolean;
+        onFlush: (entries: T[]) => Promise<void>;
+        onError?: (err: unknown) => void;
+      }): { push: (v: T) => void; flush: () => Promise<void> };
+      resolveInboundDebounceMs(cfg: MoltbotConfig, channel: string): number;
+    };
+    commands: {
+      resolveCommandAuthorizedFromAuthorizers(params: {
+        useAccessGroups: boolean;
+        authorizers: Array<{ configured: boolean; allowed: boolean }>;
+      }): boolean;
+    };
+  };
+  logging: {
+    shouldLogVerbose(): boolean;
+    getChildLogger(name: string): PluginLogger;
+  };
+  state: {
+    resolveStateDir(cfg: MoltbotConfig): string;
+  };
+};
+```
+
+Notes:
+- Runtime is the only way to access core behavior.
+- SDK is intentionally small and stable.
+- Each runtime method maps to an existing core implementation (no duplication).
+
+## Migration plan (phased, safe)
+
+### Phase 0: scaffolding
+- Introduce `@clawdbot/plugin-sdk`.
+- Add `api.runtime` to `MoltbotPluginApi` with the surface above.
+- Maintain existing imports during a transition window (deprecation warnings).
+
+### Phase 1: bridge cleanup (low risk)
+- Replace per-extension `core-bridge.ts` with `api.runtime`.
+- Migrate BlueBubbles, Zalo, Zalo Personal first (already close).
+- Remove duplicated bridge code.
+
+### Phase 2: light direct-import plugins
+- Migrate Matrix to SDK + runtime.
+- Validate onboarding, directory, group mention logic.
+
+### Phase 3: heavy direct-import plugins
+- Migrate MS Teams (largest set of runtime helpers).
+- Ensure reply/typing semantics match current behavior.
+
+### Phase 4: iMessage pluginization
+- Move iMessage into `extensions/imessage`.
+- Replace direct core calls with `api.runtime`.
+- Keep config keys, CLI behavior, and docs intact.
+
+### Phase 5: enforcement
+- Add lint rule / CI check: no `extensions/**` imports from `src/**`.
+- Add plugin SDK/version compatibility checks (runtime + SDK semver).
+
+## Compatibility and versioning
+- SDK: semver, published, documented changes.
+- Runtime: versioned per core release. Add `api.runtime.version`.
+- Plugins declare a required runtime range (e.g., `moltbotRuntime: ">=2026.2.0"`).
+
+## Testing strategy
+- Adapter-level unit tests (runtime functions exercised with real core implementation).
+- Golden tests per plugin: ensure no behavior drift (routing, pairing, allowlist, mention gating).
+- A single end-to-end plugin sample used in CI (install + run + smoke).
+
+## Open questions
+- Where to host SDK types: separate package or core export?
+- Runtime type distribution: in SDK (types only) or in core?
+- How to expose docs links for bundled vs external plugins?
+- Do we allow limited direct core imports for in-repo plugins during transition?
+
+## Success criteria
+- All channel connectors are plugins using SDK + runtime.
+- No `extensions/**` imports from `src/**`.
+- New connector templates depend only on SDK + runtime.
+- External plugins can be developed and updated without core source access.
+
+Related docs: [Plugins](/plugin), [Channels](/channels/index), [Configuration](/gateway/configuration).
--- a/docker-compose/ez-assistant/docs/refactor/strict-config.md
+++ b/docker-compose/ez-assistant/docs/refactor/strict-config.md
@@ -0,0 +1,81 @@
+---
+summary: "Strict config validation + doctor-only migrations"
+read_when:
+  - Designing or implementing config validation behavior
+  - Working on config migrations or doctor workflows
+  - Handling plugin config schemas or plugin load gating
+---
+# Strict config validation (doctor-only migrations)
+
+## Goals
+- **Reject unknown config keys everywhere** (root + nested).
+- **Reject plugin config without a schema**; don’t load that plugin.
+- **Remove legacy auto-migration on load**; migrations run via doctor only.
+- **Auto-run doctor (dry-run) on startup**; if invalid, block non-diagnostic commands.
+
+## Non-goals
+- Backward compatibility on load (legacy keys do not auto-migrate).
+- Silent drops of unrecognized keys.
+
+## Strict validation rules
+- Config must match the schema exactly at every level.
+- Unknown keys are validation errors (no passthrough at root or nested).
+- `plugins.entries.<id>.config` must be validated by the plugin’s schema.
+  - If a plugin lacks a schema, **reject plugin load** and surface a clear error.
+- Unknown `channels.<id>` keys are errors unless a plugin manifest declares the channel id.
+- Plugin manifests (`moltbot.plugin.json`) are required for all plugins.
+
+## Plugin schema enforcement
+- Each plugin provides a strict JSON Schema for its config (inline in the manifest).
+- Plugin load flow:
+  1) Resolve plugin manifest + schema (`moltbot.plugin.json`).
+  2) Validate config against the schema.
+  3) If missing schema or invalid config: block plugin load, record error.
+- Error message includes:
+  - Plugin id
+  - Reason (missing schema / invalid config)
+  - Path(s) that failed validation
+- Disabled plugins keep their config, but Doctor + logs surface a warning.
+
+## Doctor flow
+- Doctor runs **every time** config is loaded (dry-run by default).
+- If config invalid:
+  - Print a summary + actionable errors.
+  - Instruct: `moltbot doctor --fix`.
+- `moltbot doctor --fix`:
+  - Applies migrations.
+  - Removes unknown keys.
+  - Writes updated config.
+
+## Command gating (when config is invalid)
+Allowed (diagnostic-only):
+- `moltbot doctor`
+- `moltbot logs`
+- `moltbot health`
+- `moltbot help`
+- `moltbot status`
+- `moltbot gateway status`
+
+Everything else must hard-fail with: “Config invalid. Run `moltbot doctor --fix`.”
+
+## Error UX format
+- Single summary header.
+- Grouped sections:
+  - Unknown keys (full paths)
+  - Legacy keys / migrations needed
+  - Plugin load failures (plugin id + reason + path)
+
+## Implementation touchpoints
+- `src/config/zod-schema.ts`: remove root passthrough; strict objects everywhere.
+- `src/config/zod-schema.providers.ts`: ensure strict channel schemas.
+- `src/config/validation.ts`: fail on unknown keys; do not apply legacy migrations.
+- `src/config/io.ts`: remove legacy auto-migrations; always run doctor dry-run.
+- `src/config/legacy*.ts`: move usage to doctor only.
+- `src/plugins/*`: add schema registry + gating.
+- CLI command gating in `src/cli`.
+
+## Tests
+- Unknown key rejection (root + nested).
+- Plugin missing schema → plugin load blocked with clear error.
+- Invalid config → gateway startup blocked except diagnostic commands.
+- Doctor dry-run auto; `doctor --fix` writes corrected config.