OpenAI details Realtime voice, translation, and transcription sessions
OpenAI published API documentation detailing its 'Realtime and audio' capabilities, including building low-latency voice agents, live translation, and transcription. The documentation outlines different session types (voice agent, translation, transcription) and connection methods such as WebRTC, WebSocket, and SIP, along with models like gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper. It also provides guidance on safety identifiers and migration from beta to GA interfaces.
Key Takeaways
- gpt-realtime-2 is the model OpenAI lists for low-latency voice-agent sessions on `/v1/realtime`.
- gpt-realtime-translate is tied to continuous translation sessions on `/v1/realtime/translations`.
- gpt-realtime-whisper powers realtime transcription with controllable latency and transcript deltas.
- OpenAI says browser and mobile clients should use WebRTC, while server media pipelines can use WebSocket.
- GA migration includes removing `OpenAI-Beta: realtime=v1` and using `POST /v1/realtime/client_secrets` for ephemeral credentials.
Why It Matters
OpenAI has turned realtime audio into a documented production path rather than a loose beta pattern, with separate flows for voice agents, translation, and transcription. That matters for streaming and media teams because the API now spells out which model, endpoint, and transport fit each audio job, including WebRTC for browsers, WebSocket for server pipelines, and SIP for telephony. The clearest signal to watch is whether teams adopt the GA interface changes — especially `/v1/realtime/client_secrets`, `/v1/realtime/calls`, and the newer event names such as `response.output_text.delta` and `response.output_audio.delta`.
Read full article at platform.openai.com
