Audio-to-Video WebSocket API (Beta)

The Audio-to-Video WebSocket API enables you to stream audio data to drive AI avatars in near real-time by integrating with WebRTC providers. It is designed for server-to-server connections and supports faster-than-real-time speech sources such as conversational frameworks, text-to-speech engines, and speech-to-speech systems.

Beta API

This is a beta API with the following characteristics:

• Real-time audio streaming
• WebSocket-based communication
• Server-to-server connections
• WebRTC provider integration
• Event-driven architecture

Suitable Use Cases

Operating your own backend voice orchestration stack (e.g., LiveKit Agent, Pipecat, OpenAI Realtime) to control avatar speech.
Implementing custom workflows requiring precise control over speech timing and audio input.
Integrations that connect with WebRTC networks like LiveKit, Daily, or Agora for low-latency avatar video streaming.

Not Designed For

Direct audio input from end-user devices like browsers or mobile apps.

WebSocket Endpoint

The WebSocket URL is provided in the realtime_endpoint field from the response of the POST /v1/streaming/new API call, formatted as:

wss://webrtc-signaling.konpro.ai/v1-alpha/interactive-avatar/session/<session_id>

Client Actions (JSON Messages over WebSocket)

agent.speak

Stream base64-encoded PCM 16-bit 24kHz audio chunks for avatar speech.

JSON

{
  "type": "agent.speak",
  "event_id": "<event_id>",
  "audio": "<Base64 encoded PCM audio chunk>"
}

agent.speak_end

Signal end of speech audio, optional final audio chunk included.

JSON

{
  "type": "agent.speak_end",
  "event_id": "<event_id>",
  "audio": "<optional final base64 audio chunk>"
}

agent.audio_buffer_clear

Discard buffered audio.

JSON

{
  "type": "agent.audio_buffer_clear",
  "event_id": "<event_id>"
}

agent.interrupt

Abort current and queued speech tasks.

JSON

{
  "type": "agent.interrupt",
  "event_id": "<event_id>"
}

agent.start_listening

Trigger avatar's listening animation (only if idle).

JSON

{
  "type": "agent.start_listening",
  "event_id": "<event_id>"
}

agent.stop_listening

Stop listening animation if active.

JSON

{
  "type": "agent.stop_listening",
  "event_id": "<event_id>"
}

session.keep_alive

Reset idle timeout to keep session active.

JSON

{
  "type": "session.keep_alive",
  "event_id": "<event_id>"
}

Server Events (JSON Messages over WebSocket)

session.state_updated: Updates on session lifecycle states (initialized, connecting, connected, disconnecting).
agent.audio_buffer_appended: Confirmation of buffered audio acceptance.
agent.audio_buffer_committed: Buffered audio finalized for playback.
agent.audio_buffer_cleared: Notification audio buffer was cleared.
agent.idle_started / agent.idle_ended: Avatar idle state changes.
agent.speak_started / agent.speak_ended: Avatar speech task lifecycle events.
agent.speak_interrupted: Avatar speech interrupted early.
error: Reports request failures with error type, message, and related client event ID.
warning: Non-fatal notices such as deprecations.

Architecture Overview

Client: Runs WebRTC client SDK (e.g., LiveKit SDK, Pipecat RTVI).
Backend: Hosts your application server with WebRTC server SDK.
Agent Worker: Manages speech orchestration, feeding audio to KonPro API.
KonPro API & Services
WebRTC Provider: LiveKit, Daily, Agora, etc.
ASR Provider: Automatic speech recognition (Deepgram, Gladia).
LLM Provider: Large language models (OpenAI, Gemini).
TTS Provider: Text-to-speech systems (11 Labs, Cartesia).

Notes

Beta API: This API is in beta; backward compatibility and API contract stability are not guaranteed. Feedback and continuous improvement are encouraged.

LANGUAGE

CREDENTIALS

Header

HEADER

RESPONSE

Examples

Choose an example:

application/json

200 - Result

400 - Result

Keep Alive List Knowledge Bases

Audio-to-Video WebSocket API (Beta)

Beta API

Suitable Use Cases

Not Designed For

WebSocket Endpoint

Client Actions (JSON Messages over WebSocket)

agent.speak

agent.speak_end

agent.audio_buffer_clear

agent.interrupt

agent.start_listening

agent.stop_listening

session.keep_alive

Server Events (JSON Messages over WebSocket)

Architecture Overview

Notes

LANGUAGE

CREDENTIALS

RESPONSE

Table of Contents