Audio-to-Video WebSocket API (Beta)

The Audio-to-Video WebSocket API enables you to stream audio data to drive AI avatars in near real-time by integrating with WebRTC providers. It is designed for server-to-server connections and supports faster-than-real-time speech sources such as conversational frameworks, text-to-speech engines, and speech-to-speech systems.

Beta API

This is a beta API with the following characteristics:

  • • Real-time audio streaming
  • • WebSocket-based communication
  • • Server-to-server connections
  • • WebRTC provider integration
  • • Event-driven architecture

Suitable Use Cases

  • Operating your own backend voice orchestration stack (e.g., LiveKit Agent, Pipecat, OpenAI Realtime) to control avatar speech.
  • Implementing custom workflows requiring precise control over speech timing and audio input.
  • Integrations that connect with WebRTC networks like LiveKit, Daily, or Agora for low-latency avatar video streaming.

Not Designed For

  • Direct audio input from end-user devices like browsers or mobile apps.

WebSocket Endpoint

The WebSocket URL is provided in the realtime_endpoint field from the response of the POST /v1/streaming/new API call, formatted as:

wss://webrtc-signaling.konpro.ai/v1-alpha/interactive-avatar/session/<session_id>

Client Actions (JSON Messages over WebSocket)

agent.speak

Stream base64-encoded PCM 16-bit 24kHz audio chunks for avatar speech.

JSON
{
  "type": "agent.speak",
  "event_id": "<event_id>",
  "audio": "<Base64 encoded PCM audio chunk>"
}

agent.speak_end

Signal end of speech audio, optional final audio chunk included.

JSON
{
  "type": "agent.speak_end",
  "event_id": "<event_id>",
  "audio": "<optional final base64 audio chunk>"
}

agent.audio_buffer_clear

Discard buffered audio.

JSON
{
  "type": "agent.audio_buffer_clear",
  "event_id": "<event_id>"
}

agent.interrupt

Abort current and queued speech tasks.

JSON
{
  "type": "agent.interrupt",
  "event_id": "<event_id>"
}

agent.start_listening

Trigger avatar's listening animation (only if idle).

JSON
{
  "type": "agent.start_listening",
  "event_id": "<event_id>"
}

agent.stop_listening

Stop listening animation if active.

JSON
{
  "type": "agent.stop_listening",
  "event_id": "<event_id>"
}

session.keep_alive

Reset idle timeout to keep session active.

JSON
{
  "type": "session.keep_alive",
  "event_id": "<event_id>"
}

Server Events (JSON Messages over WebSocket)

  • session.state_updated: Updates on session lifecycle states (initialized, connecting, connected, disconnecting).
  • agent.audio_buffer_appended: Confirmation of buffered audio acceptance.
  • agent.audio_buffer_committed: Buffered audio finalized for playback.
  • agent.audio_buffer_cleared: Notification audio buffer was cleared.
  • agent.idle_started / agent.idle_ended: Avatar idle state changes.
  • agent.speak_started / agent.speak_ended: Avatar speech task lifecycle events.
  • agent.speak_interrupted: Avatar speech interrupted early.
  • error: Reports request failures with error type, message, and related client event ID.
  • warning: Non-fatal notices such as deprecations.

Architecture Overview

  • Client: Runs WebRTC client SDK (e.g., LiveKit SDK, Pipecat RTVI).
  • Backend: Hosts your application server with WebRTC server SDK.
  • Agent Worker: Manages speech orchestration, feeding audio to KonPro API.
  • KonPro API & Services
  • WebRTC Provider: LiveKit, Daily, Agora, etc.
  • ASR Provider: Automatic speech recognition (Deepgram, Gladia).
  • LLM Provider: Large language models (OpenAI, Gemini).
  • TTS Provider: Text-to-speech systems (11 Labs, Cartesia).

Notes

Beta API: This API is in beta; backward compatibility and API contract stability are not guaranteed. Feedback and continuous improvement are encouraged.

LANGUAGE

CREDENTIALS

HEADER

RESPONSE

Examples

Choose an example:

application/json
200 - Result
400 - Result