ability-voice

Voice/audio ability

ability-voice provides on-device voice functionality for the AGENTS platform: speech-to-text (Whisper), text-to-speech (Piper), wake-word detection and continuous/listen modes. It is optimized for NVIDIA Jetson (L4T) and x86 CUDA images and exposes a small set of capabilities (transcribe, synthesize, speak, start_listening, stop_listening, etc.) over the KADI broker’s “voice” network so other agents can request voice operations.

Architecture

Key components
- Whisper (STT): offline model selection via WHISPER_MODEL / config.toml [whisper].MODEL. Runs via PyTorch.
- Piper (TTS): ONNX voice models pre-fetched into /root/.local/share/piper/voices; voice selected by config.toml [piper].VOICE.
- Wake-word / VAD: wake word string and voice activity detection (VAD) parameters configured via config.toml [wake].
- Audio I/O: ALSA / PulseAudio / PortAudio for capture and playback; ffmpeg/sndfile used for file I/O and format conversions.
- KADI broker integration: agent registers on the “voice” network and listens for ability invocations via broker URL(s).
Data flow (typical operations)
- Transcribe: audio bytes (from file or stream) → Whisper model → text result → returned over broker.
- Synthesize: text + voice selection → Piper (ONNX) → WAV audio bytes → returned or played locally.
- Speak: synthesize then play audio locally (optionally streaming playback).
- Listening loop: when LISTEN_MODE=1 or start_listening invoked, the agent starts microphone capture → wake-word/VAD logic → chunked transcribe events or single transcribe result.
How it fits in AGENTS ecosystem
- Declares network “voice” and capabilities in agent.json so orchestrators and other agents can discover available voice operations.
- Runs as an “ability” (type: kadi-ability) that other agents call via KADI broker (WSS/WS). It is intended to be a shared service for STT/TTS for edge agents.

Tools / API

The ability exposes the following capability names (as seen in agent.json metadata). Each capability is callable via the KADI ability invocation pattern on the “voice” network.

Tool / Capability	Description	Key parameters
transcribe	Transcribes raw audio bytes (PCM/WAV) using Whisper.	audio (bytes or base64), model (optional; overrides WHISPER_MODEL/config), language (optional)
transcribe_file	Transcribe audio from a file path accessible to the agent.	path (string), model (optional)
synthesize	Produce audio bytes for given text using Piper TTS.	text (string), voice (optional; defaults to piper.VOICE), sample_rate (optional)
speak	Synthesize and play text locally on the device’s audio output.	text (string), voice (optional), interrupt (optional)
list_voices	List installed Piper voices (reads /root/.local/share/piper/voices).	(no params)
start_listening	Start continuous listen/wake-word mode (non-blocking). Emits events or returns listener handle.	mode (optional; e.g., “wake” or “vad”), silent (optional)
stop_listening	Stop a running listening session.	listener_id (or none to stop default)
listener_status	Query status of the current listener (running, idle, last_event).	(no params)

Note: exact request/response envelope follows the KADI ability protocol (ability name: ability-voice, network: voice). The above table documents the semantic surface you can expect this ability to implement.

If tool registration calls are present in the runtime, they will register these capability names with the broker so other agents can discover them.

Configuration

Configuration is read from config.toml by default; environment variables can override behavior. The important fields and environment variables found in the source (agent.json, config.toml and deploy blocks):

config.toml fields

[broker.local]
- URL: “ws://localhost:8080/kadi”
- NETWORKS: [“voice”]
- MODE: “native”
[broker.remote]
- URL: “wss://broker.dadavidtseng.com/kadi”
- NETWORKS: [“voice”]
[whisper]
- MODEL: “base” — default Whisper model to use for STT
[piper]
- VOICE: “en_US-lessac-medium” — default Piper voice id
[wake]
- WORD: “hey kadi” — wake-word phrase
- VAD_AGGRESSIVENESS: 3 — VAD aggressiveness (0-3)
- SILENCE_TIMEOUT_MS: 1500 — silence cutoff for recording end
- MAX_RECORDING_SECONDS: 30 — maximum single recording length

Important environment variables (from agent.json deploy/scripts)

KADI_BROKER_URL — broker URL (e.g., wss://broker.dadavidtseng.com/kadi)
KADI_AGENT_NAME — agent name (ability-voice)
WHISPER_MODEL — overrides [whisper].MODEL (e.g., base.en)
LISTEN_MODE — if set (non-empty), starts the agent in continuous listen mode
STANDALONE — if set, runs without connecting to remote broker (local-only)
NVIDIA_VISIBLE_DEVICES / NVIDIA_DRIVER_CAPABILITIES — used in container to enable GPU devices
Other runtime envs from deploy blocks are passed to the container runtime as needed.

Secrets / Vault

No explicit secret vault fields exist in the provided source. If you need broker credentials or model access tokens (e.g., huggingface tokens), inject them as environment variables or host-mounted credential files and adjust the agent code to read them. The agent supports broker URLs with TLS (wss) as configured in agent.json.

Build/runtime scripts (from agent.json.scripts)

preflight: python3 -c “import torch; print(f’CUDA available: {torch.cuda.is_available()}’)” || python3 —version
setup: python3 -m venv venv && ./venv/bin/pip install —upgrade pip && ./venv/bin/pip install -r requirements.txt
start: ./venv/bin/python3 -m ability_voice
listen: LISTEN_MODE=1 ./venv/bin/python3 -m ability_voice
standalone: STANDALONE=1 ./venv/bin/python3 -m ability_voice
test: ./venv/bin/python3 -m pytest tests/

Images / special build considerations

L4T/PyTorch images are used for Jetson targets (nvcr.io/nvidia/l4t-pytorch:r36.2.0-…).
Piper voice ONNX files are downloaded in the image build and stored under /root/.local/share/piper/voices.

Code Examples

Below are pragmatic examples showing how other agents or clients typically interact with ability-voice. These are representative usage patterns (connect to broker, invoke capability names listed above). Replace values with your actual broker URL, agent name, and parameters.

Example: invoke transcribe (TypeScript-style example calling a generic WebSocket/KADI broker)

// Connect to KADI broker and request "transcribe" from ability-voice
import WebSocket from "ws";

const brokerUrl = "wss://broker.dadavidtseng.com/kadi";
const ws = new WebSocket(brokerUrl);

ws.on("open", () => {
  // Example KADI-style ability invocation envelope (adjust to your broker's API)
  const request = {
    type: "ability.invoke",
    ability: "ability-voice",
    action: "transcribe",
    network: "voice",
    params: {
      // base64 audio payload or reference
      audio_b64: "<BASE64_PCM_OR_WAV>",
      model: "base"
    },
    request_id: "req-1234",
    reply_to: "my-agent-name"
  };
  ws.send(JSON.stringify(request));
});

ws.on("message", (msg) => {
  const resp = JSON.parse(msg.toString());
  console.log("response:", resp);
});

Example: request a TTS synthesis and then play via a downstream audio player

// Request synthesis
const synthRequest = {
  type: "ability.invoke",
  ability: "ability-voice",
  action: "synthesize",
  network: "voice",
  params: {
    text: "Hello from KADI",
    voice: "en_US-lessac-medium"
  },
  request_id: "synth-1",
  reply_to: "my-agent"
};
ws.send(JSON.stringify(synthRequest));

// Expect response with audio bytes (base64). Then decode and stream to playback subsystem.

Example: start and stop the listener

// Start listening (wake-word mode)
ws.send(JSON.stringify({
  type: "ability.invoke",
  ability: "ability-voice",
  action: "start_listening",
  network: "voice",
  params: { mode: "wake" },
  request_id: "listen-1",
  reply_to: "my-agent"
}));

// Later, stop
ws.send(JSON.stringify({
  type: "ability.invoke",
  ability: "ability-voice",
  action: "stop_listening",
  network: "voice",
  request_id: "listen-stop",
  reply_to: "my-agent"
}));

Note: The exact broker message format depends on your KADI broker client library. The above payloads use a straightforward envelope: {type, ability, action, params, request_id, reply_to}. Adapt them to your project’s KADI SDK.

Also see agent.json scripts to run the ability locally:

# start normally
./venv/bin/python3 -m ability_voice

# start in listen mode
LISTEN_MODE=1 ./venv/bin/python3 -m ability_voice

# start standalone (no remote broker)
STANDALONE=1 ./venv/bin/python3 -m ability_voice

Dependencies

Abilities
- secret-ability: ^0.4.1 (declared under “abilities” in agent.json)
System packages (installed in image builds)
- python3-pyaudio, portaudio19-dev, libsndfile1, ffmpeg, espeak-ng, libespeak-ng-dev, git, cmake, build-essential, alsa-utils, pulseaudio-utils
Python / ML libraries (installed via requirements.txt / pip)
- PyTorch (CUDA-capable on Jetson/x86 images used), Whisper-related packages, Piper runtime/onnx inference runtime
ONNX voice models
- en_US-lessac-medium.onnx and its .json metadata are downloaded during image build into /root/.local/share/piper/voices
Container base images
- Jetson: nvcr.io/nvidia/l4t-pytorch:r36.2.0-pth2.3-py3
- x86: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime

What depends on ability-voice

Any AGENTS/agents that require speech capabilities (transcribe, speak, synthesize) should call this ability over the “voice” network.
Higher-level voice assistants, dialog managers, or orchestrators can use start_listening/stop_listening to control microphone sessions and subscribe to transcription events.

If you need to modify the behavior:

Adjust config.toml [whisper]/[piper]/[wake] fields for model and detection behavior.
Add or replace Piper ONNX voice files in /root/.local/share/piper/voices in the build step.
For broker/agent integration changes, inspect src/ability_voice/main.py (entrypoint) and ability registration code to see exact message envelope and event emission hooks.

If you want more detailed code-level guidance (functions, classes, or runtime handlers), provide the src/ability_voice source files and I will document actual function signatures and call sites.