ability-voice
Voice/audio ability
ability-voice provides on-device voice functionality for the AGENTS platform: speech-to-text (Whisper), text-to-speech (Piper), wake-word detection and continuous/listen modes. It is optimized for NVIDIA Jetson (L4T) and x86 CUDA images and exposes a small set of capabilities (transcribe, synthesize, speak, start_listening, stop_listening, etc.) over the KADI broker’s “voice” network so other agents can request voice operations.
Architecture
Section titled “Architecture”- Key components
- Whisper (STT): offline model selection via WHISPER_MODEL / config.toml [whisper].MODEL. Runs via PyTorch.
- Piper (TTS): ONNX voice models pre-fetched into /root/.local/share/piper/voices; voice selected by config.toml [piper].VOICE.
- Wake-word / VAD: wake word string and voice activity detection (VAD) parameters configured via config.toml [wake].
- Audio I/O: ALSA / PulseAudio / PortAudio for capture and playback; ffmpeg/sndfile used for file I/O and format conversions.
- KADI broker integration: agent registers on the “voice” network and listens for ability invocations via broker URL(s).
- Data flow (typical operations)
- Transcribe: audio bytes (from file or stream) → Whisper model → text result → returned over broker.
- Synthesize: text + voice selection → Piper (ONNX) → WAV audio bytes → returned or played locally.
- Speak: synthesize then play audio locally (optionally streaming playback).
- Listening loop: when LISTEN_MODE=1 or start_listening invoked, the agent starts microphone capture → wake-word/VAD logic → chunked transcribe events or single transcribe result.
- How it fits in AGENTS ecosystem
- Declares network “voice” and capabilities in agent.json so orchestrators and other agents can discover available voice operations.
- Runs as an “ability” (type: kadi-ability) that other agents call via KADI broker (WSS/WS). It is intended to be a shared service for STT/TTS for edge agents.
Tools / API
Section titled “Tools / API”The ability exposes the following capability names (as seen in agent.json metadata). Each capability is callable via the KADI ability invocation pattern on the “voice” network.
| Tool / Capability | Description | Key parameters |
|---|---|---|
| transcribe | Transcribes raw audio bytes (PCM/WAV) using Whisper. | audio (bytes or base64), model (optional; overrides WHISPER_MODEL/config), language (optional) |
| transcribe_file | Transcribe audio from a file path accessible to the agent. | path (string), model (optional) |
| synthesize | Produce audio bytes for given text using Piper TTS. | text (string), voice (optional; defaults to piper.VOICE), sample_rate (optional) |
| speak | Synthesize and play text locally on the device’s audio output. | text (string), voice (optional), interrupt (optional) |
| list_voices | List installed Piper voices (reads /root/.local/share/piper/voices). | (no params) |
| start_listening | Start continuous listen/wake-word mode (non-blocking). Emits events or returns listener handle. | mode (optional; e.g., “wake” or “vad”), silent (optional) |
| stop_listening | Stop a running listening session. | listener_id (or none to stop default) |
| listener_status | Query status of the current listener (running, idle, last_event). | (no params) |
Note: exact request/response envelope follows the KADI ability protocol (ability name: ability-voice, network: voice). The above table documents the semantic surface you can expect this ability to implement.
If tool registration calls are present in the runtime, they will register these capability names with the broker so other agents can discover them.
Configuration
Section titled “Configuration”Configuration is read from config.toml by default; environment variables can override behavior. The important fields and environment variables found in the source (agent.json, config.toml and deploy blocks):
config.toml fields
- [broker.local]
- URL: “ws://localhost:8080/kadi”
- NETWORKS: [“voice”]
- MODE: “native”
- [broker.remote]
- URL: “wss://broker.dadavidtseng.com/kadi”
- NETWORKS: [“voice”]
- [whisper]
- MODEL: “base” — default Whisper model to use for STT
- [piper]
- VOICE: “en_US-lessac-medium” — default Piper voice id
- [wake]
- WORD: “hey kadi” — wake-word phrase
- VAD_AGGRESSIVENESS: 3 — VAD aggressiveness (0-3)
- SILENCE_TIMEOUT_MS: 1500 — silence cutoff for recording end
- MAX_RECORDING_SECONDS: 30 — maximum single recording length
Important environment variables (from agent.json deploy/scripts)
- KADI_BROKER_URL — broker URL (e.g., wss://broker.dadavidtseng.com/kadi)
- KADI_AGENT_NAME — agent name (ability-voice)
- WHISPER_MODEL — overrides [whisper].MODEL (e.g., base.en)
- LISTEN_MODE — if set (non-empty), starts the agent in continuous listen mode
- STANDALONE — if set, runs without connecting to remote broker (local-only)
- NVIDIA_VISIBLE_DEVICES / NVIDIA_DRIVER_CAPABILITIES — used in container to enable GPU devices
- Other runtime envs from deploy blocks are passed to the container runtime as needed.
Secrets / Vault
- No explicit secret vault fields exist in the provided source. If you need broker credentials or model access tokens (e.g., huggingface tokens), inject them as environment variables or host-mounted credential files and adjust the agent code to read them. The agent supports broker URLs with TLS (wss) as configured in agent.json.
Build/runtime scripts (from agent.json.scripts)
- preflight: python3 -c “import torch; print(f’CUDA available: {torch.cuda.is_available()}’)” || python3 —version
- setup: python3 -m venv venv && ./venv/bin/pip install —upgrade pip && ./venv/bin/pip install -r requirements.txt
- start: ./venv/bin/python3 -m ability_voice
- listen: LISTEN_MODE=1 ./venv/bin/python3 -m ability_voice
- standalone: STANDALONE=1 ./venv/bin/python3 -m ability_voice
- test: ./venv/bin/python3 -m pytest tests/
Images / special build considerations
- L4T/PyTorch images are used for Jetson targets (nvcr.io/nvidia/l4t-pytorch:r36.2.0-…).
- Piper voice ONNX files are downloaded in the image build and stored under /root/.local/share/piper/voices.
Code Examples
Section titled “Code Examples”Below are pragmatic examples showing how other agents or clients typically interact with ability-voice. These are representative usage patterns (connect to broker, invoke capability names listed above). Replace values with your actual broker URL, agent name, and parameters.
Example: invoke transcribe (TypeScript-style example calling a generic WebSocket/KADI broker)
// Connect to KADI broker and request "transcribe" from ability-voiceimport WebSocket from "ws";
const brokerUrl = "wss://broker.dadavidtseng.com/kadi";const ws = new WebSocket(brokerUrl);
ws.on("open", () => { // Example KADI-style ability invocation envelope (adjust to your broker's API) const request = { type: "ability.invoke", ability: "ability-voice", action: "transcribe", network: "voice", params: { // base64 audio payload or reference audio_b64: "<BASE64_PCM_OR_WAV>", model: "base" }, request_id: "req-1234", reply_to: "my-agent-name" }; ws.send(JSON.stringify(request));});
ws.on("message", (msg) => { const resp = JSON.parse(msg.toString()); console.log("response:", resp);});Example: request a TTS synthesis and then play via a downstream audio player
// Request synthesisconst synthRequest = { type: "ability.invoke", ability: "ability-voice", action: "synthesize", network: "voice", params: { text: "Hello from KADI", voice: "en_US-lessac-medium" }, request_id: "synth-1", reply_to: "my-agent"};ws.send(JSON.stringify(synthRequest));
// Expect response with audio bytes (base64). Then decode and stream to playback subsystem.Example: start and stop the listener
// Start listening (wake-word mode)ws.send(JSON.stringify({ type: "ability.invoke", ability: "ability-voice", action: "start_listening", network: "voice", params: { mode: "wake" }, request_id: "listen-1", reply_to: "my-agent"}));
// Later, stopws.send(JSON.stringify({ type: "ability.invoke", ability: "ability-voice", action: "stop_listening", network: "voice", request_id: "listen-stop", reply_to: "my-agent"}));Note: The exact broker message format depends on your KADI broker client library. The above payloads use a straightforward envelope: {type, ability, action, params, request_id, reply_to}. Adapt them to your project’s KADI SDK.
Also see agent.json scripts to run the ability locally:
# start normally./venv/bin/python3 -m ability_voice
# start in listen modeLISTEN_MODE=1 ./venv/bin/python3 -m ability_voice
# start standalone (no remote broker)STANDALONE=1 ./venv/bin/python3 -m ability_voiceDependencies
Section titled “Dependencies”- Abilities
- secret-ability: ^0.4.1 (declared under “abilities” in agent.json)
- System packages (installed in image builds)
- python3-pyaudio, portaudio19-dev, libsndfile1, ffmpeg, espeak-ng, libespeak-ng-dev, git, cmake, build-essential, alsa-utils, pulseaudio-utils
- Python / ML libraries (installed via requirements.txt / pip)
- PyTorch (CUDA-capable on Jetson/x86 images used), Whisper-related packages, Piper runtime/onnx inference runtime
- ONNX voice models
- en_US-lessac-medium.onnx and its .json metadata are downloaded during image build into /root/.local/share/piper/voices
- Container base images
- Jetson: nvcr.io/nvidia/l4t-pytorch:r36.2.0-pth2.3-py3
- x86: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
What depends on ability-voice
- Any AGENTS/agents that require speech capabilities (transcribe, speak, synthesize) should call this ability over the “voice” network.
- Higher-level voice assistants, dialog managers, or orchestrators can use start_listening/stop_listening to control microphone sessions and subscribe to transcription events.
If you need to modify the behavior:
- Adjust config.toml [whisper]/[piper]/[wake] fields for model and detection behavior.
- Add or replace Piper ONNX voice files in /root/.local/share/piper/voices in the build step.
- For broker/agent integration changes, inspect src/ability_voice/main.py (entrypoint) and ability registration code to see exact message envelope and event emission hooks.
If you want more detailed code-level guidance (functions, classes, or runtime handlers), provide the src/ability_voice source files and I will document actual function signatures and call sites.