Audio-to-Video WebSocket API (Beta)
The Audio-to-Video WebSocket API enables you to stream audio data to drive AI avatars in near real-time by integrating with WebRTC providers. It is designed for server-to-server connections and supports faster-than-real-time speech sources such as conversational frameworks, text-to-speech engines, and speech-to-speech systems.
Beta API
This is a beta API with the following characteristics:
- • Real-time audio streaming
- • WebSocket-based communication
- • Server-to-server connections
- • WebRTC provider integration
- • Event-driven architecture
Suitable Use Cases
- Operating your own backend voice orchestration stack (e.g., LiveKit Agent, Pipecat, OpenAI Realtime) to control avatar speech.
- Implementing custom workflows requiring precise control over speech timing and audio input.
- Integrations that connect with WebRTC networks like LiveKit, Daily, or Agora for low-latency avatar video streaming.
Not Designed For
- Direct audio input from end-user devices like browsers or mobile apps.
WebSocket Endpoint
The WebSocket URL is provided in the realtime_endpoint field from the response of the POST /v1/streaming/new API call, formatted as:
wss://webrtc-signaling.konpro.ai/v1-alpha/interactive-avatar/session/<session_id>Client Actions (JSON Messages over WebSocket)
agent.speak
Stream base64-encoded PCM 16-bit 24kHz audio chunks for avatar speech.
{
"type": "agent.speak",
"event_id": "<event_id>",
"audio": "<Base64 encoded PCM audio chunk>"
}agent.speak_end
Signal end of speech audio, optional final audio chunk included.
{
"type": "agent.speak_end",
"event_id": "<event_id>",
"audio": "<optional final base64 audio chunk>"
}agent.audio_buffer_clear
Discard buffered audio.
{
"type": "agent.audio_buffer_clear",
"event_id": "<event_id>"
}agent.interrupt
Abort current and queued speech tasks.
{
"type": "agent.interrupt",
"event_id": "<event_id>"
}agent.start_listening
Trigger avatar's listening animation (only if idle).
{
"type": "agent.start_listening",
"event_id": "<event_id>"
}agent.stop_listening
Stop listening animation if active.
{
"type": "agent.stop_listening",
"event_id": "<event_id>"
}session.keep_alive
Reset idle timeout to keep session active.
{
"type": "session.keep_alive",
"event_id": "<event_id>"
}Server Events (JSON Messages over WebSocket)
- session.state_updated: Updates on session lifecycle states (initialized, connecting, connected, disconnecting).
- agent.audio_buffer_appended: Confirmation of buffered audio acceptance.
- agent.audio_buffer_committed: Buffered audio finalized for playback.
- agent.audio_buffer_cleared: Notification audio buffer was cleared.
- agent.idle_started / agent.idle_ended: Avatar idle state changes.
- agent.speak_started / agent.speak_ended: Avatar speech task lifecycle events.
- agent.speak_interrupted: Avatar speech interrupted early.
- error: Reports request failures with error type, message, and related client event ID.
- warning: Non-fatal notices such as deprecations.
Architecture Overview
- Client: Runs WebRTC client SDK (e.g., LiveKit SDK, Pipecat RTVI).
- Backend: Hosts your application server with WebRTC server SDK.
- Agent Worker: Manages speech orchestration, feeding audio to KonPro API.
- KonPro API & Services
- WebRTC Provider: LiveKit, Daily, Agora, etc.
- ASR Provider: Automatic speech recognition (Deepgram, Gladia).
- LLM Provider: Large language models (OpenAI, Gemini).
- TTS Provider: Text-to-speech systems (11 Labs, Cartesia).
Notes
Beta API: This API is in beta; backward compatibility and API contract stability are not guaranteed. Feedback and continuous improvement are encouraged.
LANGUAGE
CREDENTIALS
RESPONSE
Choose an example: