ability-vision
Vision/screenshot ability
ability-vision is a Kadi “ability” that provides vision analysis and image-understanding capabilities backed by multimodal LLMs (for example Claude Sonnet / GPT-4V). It runs as a native ability that connects to the local broker and listens on the “vision” network. The package’s primary role is to expose vision-analysis functionality to other agents in the AGENTS ecosystem while delegating secrets and credentials to a separate secret ability / vault.
Architecture
Section titled “Architecture”- Runtime: Native ability process (MODE = “native”) that connects over WebSocket to the local broker.
- Broker connectivity:
- Reads broker configuration from config.toml -> [broker.local].
- Connects to broker at broker.local.URL (default ws://localhost:8080/kadi).
- Joins one or more networks specified by broker.local.NETWORKS; this ability uses the “vision” network.
- Model manager:
- Model configuration is provided in the [model] section of config.toml.
- Vision inference/model selection is configured via VISION_MODEL and managed through a manager base URL (MANAGER_BASE_URL) if present.
- Secrets:
- Secrets (API keys to multimodal LLMs, vault tokens, etc.) are not stored in config.toml. They must be placed in the encrypted secrets.toml vault (or provided by the referenced secret-ability).
- agent.json declares a dependency on “secret-ability” (see Dependencies).
- Data flow (typical):
- Ability process starts and connects to the broker.
- It registers itself on the “vision” network and awaits requests.
- Another agent or client posts an image-analysis request to the broker network/topic; the ability receives it, retrieves required secrets (via secret-ability or vault), calls the configured multimodal model endpoint, and returns structured results back via the broker.
- Integration points:
- entrypoint: dist/index.js — this module should create the Kadi ability instance and register tools/handlers.
- Abilities list: ability-vision depends on secret-ability for credential management. Other agents that need vision capabilities will call across the “vision” network.
Tools / API
Section titled “Tools / API”The provided source (agent.json + config.toml) does not include an explicit list of registered tools in code form. The single entrypoint is dist/index.js and the runtime is expected to register vision-related tools/handlers on the “vision” network.
Summary table of the package runtime surface (from available source metadata):
| Name / file | Purpose |
|---|---|
| dist/index.js | Ability entrypoint. Expected to export the runtime that registers vision handlers/tools. |
| config.toml -> broker.local.NETWORKS | Network(s) the ability joins (defaults to [“vision”]). |
| agent.json -> abilities.secret-ability | Declares dependency on the secret-ability (used to fetch encrypted secrets). |
Note: If your local implementation registers explicit tools (e.g., analyzeImage, describeScreenshot, ocrImage), they will be defined inside dist/index.js. Inspect that file to see precise tool names and parameters.
Configuration
Section titled “Configuration”Configuration is read from config.toml. Secrets must be placed in the encrypted secrets.toml vault (or provided by the secret ability). Relevant fields:
-
broker.local
- URL (string): WebSocket URL for the broker.
- Example: “ws://localhost:8080/kadi”
- NETWORKS (array[string]): One or more networks this ability joins.
- Example: [“vision”]
- MODE (string): Execution mode. For ability-vision the default is “native”.
- URL (string): WebSocket URL for the broker.
-
model
- MANAGER_BASE_URL (string): Optional base URL for a model manager/proxy (left blank by default).
- VISION_MODEL (string): Which multimodal model to use. Default in config.toml: “claude-sonnet-4-20250514”.
- MAX_TOKENS (number): Token limit (default 4096).
Secrets:
- secrets.toml (encrypted vault): All runtime secrets (API keys, vault tokens, etc.) must go into the secrets vault. This ability relies on the declared “secret-ability” to fetch these secrets at runtime.
agent.json metadata (highlights):
- name: ability-vision
- type: ability
- entrypoint: dist/index.js
- abilities: { “secret-ability”: ”*” } — declares a required ability for secrets.
Example config.toml (copied from source) in a TypeScript snippet for reference:
export const CONFIG_TOML = `# Ability Vision Configuration# Secrets go in secrets.toml (encrypted vault)
[broker.local]URL = "ws://localhost:8080/kadi"NETWORKS = ["vision"]MODE = "native"
[model]MANAGER_BASE_URL = ""VISION_MODEL = "claude-sonnet-4-20250514"MAX_TOKENS = 4096`;agent.json metadata in TypeScript form (copied from source) for quick reference:
export const AGENT_JSON = { "name": "ability-vision", "type": "ability", "version": "0.1.0", "description": "Vision analysis ability - image understanding via multimodal LLMs (Claude, GPT-4V)", "entrypoint": "dist/index.js", "scripts": { "preflight": "node --version", "setup": "npm install && npm run build", "build": "npx tsc", "start": "node dist/index.js", "dev": "npx tsx index.ts", "serve": "npx tsx index.ts stdio", "serve:broker": "npx tsx index.ts broker", "clean": "rm -rf node_modules abilities agent-lock.json package-lock.json dist" }, "abilities": { "secret-ability": "*" }} as const;Runtime flags / scripts:
- npm run dev: run index.ts via tsx (development).
- npm run serve: run index.ts in stdio mode.
- npm run serve:broker: run index.ts with broker connection mode.
- npm run build: tsc -> produces dist/index.js which is the runtime entrypoint.
Code Examples
Section titled “Code Examples”The repository metadata and config are the canonical configuration artifacts. Below are TypeScript-wrapped copies of those artifacts for quick reference and to use in tests or type-checked utilities.
- agent.json (copy shown above) — use AGENT_JSON constant to examine declared dependencies and entrypoint.
- config.toml (copy shown above) — use CONFIG_TOML constant to seed a test config or to parse during unit tests.
If you have built code in dist/index.js, inspect that file to discover exported tool names and signatures. Typical patterns in abilities (what to look for inside dist/index.js):
- initialization that reads config.toml and secrets from the vault
- broker connection logic that uses broker.local.URL and joins broker.local.NETWORKS
- tool/handler registration on the kadi ability runtime (registerTool/registerHandler or equivalent functions from @kadi.build/core)
- calls out to the configured VISION_MODEL via a manager or direct API client
Since dist/index.js is the canonical implementation, open it to copy exact API names for the tools you must call.
Dependencies
Section titled “Dependencies”From package metadata:
Runtime dependencies:
- @kadi.build/core: * (core Kadi runtime utilities)
- tsx: ^4.21.0 (used in development scripts)
Dev dependencies:
- typescript
- @types/node
Ability-level dependencies (declared in agent.json):
- secret-ability — a required ability that provides access to secrets/vault. ability-vision expects secret-ability to be available and to provide secure retrieval of API keys and tokens required to call multimodal LLM endpoints.
What depends on ability-vision:
- Any agent that requires vision/image analysis should call into the “vision” network on the broker. Those calling agents will be consumers of the tools/handlers exported by ability-vision (names and RPC signatures are defined in dist/index.js).
How to inspect/extend
Section titled “How to inspect/extend”- Open dist/index.js (or source index.ts if present) to:
- Find exact tool names and handler signatures that other agents will call.
- Locate where the code reads [model] configuration (VISION_MODEL, MAX_TOKENS).
- Locate how secrets are fetched (via secret-ability RPC calls or direct vault client) and adjust to support additional secret keys or providers.
- Update config.toml to change broker URL/network or to point to a model manager.
- Add or modify dependency on secret-ability if you need a different secrets provider.
If you need more detailed, per-tool documentation, provide the dist/index.js or the TypeScript source for index.ts so the exact exported tool names and call signatures can be documented verbatim.