Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API. Credit: Tero Vesalainen/Shutterstock OpenAI has introduced a public beta of the Realtime API, an API that allows paid developers to build low-latency, multi-modal experiences including text and speech in apps. Introduced October 1, the Realtime API, similar to the OpenAI ChatGPT Advanced Voice Mode, supports natural speech-to-speech conversations using preset voices that the API already supports. OpenAI also is introducing audio input and output in the Chat Completions API to support use cases that do not need the low-latency benefits of the Realtime API. Developers can pass text or audio inputs into GPT-4o and have the model respond with text, audio, or both. With the Realtime API and the audio support in the Chat Completions API, developers do not have to link together multiple models to power voice experiences. They can build natural conversational experiences with just one API call, OpenAI said. Previously, creating a similar voice experience had developers transcribing an automatic speech recognition model such as Whisper, passing text to a text model for inference or reasoning, and playing the model’s output using a text-to-speech model. This approach often resulted in loss of emotion, emphasis, and accents, plus latency. With the Chat Completions API, developers can deal with the entire process with one API call, though it remains slower than human conversation. The Realtime API improves latency by streaming audio inputs and outputs directly, enabling more natural conversational experiences, OpenAI said. The Realtime API also can handle interruptions automatically, like ChatGPT’s advanced voice mode. The Realtime API enables development of a persistent WebSocket connection to exchange messages with GPT-4o. The API backs function calling, which makes it possible for voice assistants to respond to user requests by pulling in new context or triggering actions. Also, the Realtime API leverages multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs. The Realtime API uses text tokens and audio tokens. Text input costs $5 per 1M tokens and text output costs $20 per 1M tokens. Audio input costs $100 per 1M tokens and audio output costs $200 per 1M tokens. OpenAI said plans for improving the Realtime API include adding support for vision and video, increasing rate limits, adding support for prompt caching, and expanding model support to GPT-4o mini. The company said it would also integrate support for the Realtime API into the OpenAI Python and Node.js SDKs. Related content feature 14 great preprocessors for developers who love to code Sometimes it seems like the rules of programming are designed to make coding a chore. Here are 14 ways preprocessors can help make software development fun again. By Peter Wayner Nov 18, 2024 10 mins Development Tools Software Development news JetBrains IDEs ease debugging for Kubernetes apps Version 2024.3 updates to IntelliJ, PyCharm, WebStorm, and other JetBrains IDEs streamline remote debugging of Kubernetes microservices and much more. By Paul Krill Nov 14, 2024 3 mins Integrated Development Environments Java Python analysis Understanding Hyperlight, Microsoft’s minimal VM manager Microsoft is making its Rust-based, functions-focused VM tool available on Azure at last, ready to help event-driven applications at scale. By Simon Bisson Nov 14, 2024 8 mins Microsoft Azure Rust Serverless Computing analysis GitHub Copilot learns new tricks GitHub and Microsoft have taken their AI-powered programming assistant into new territories, tackling code reviews, simple web apps, Java upgrades, and Azure help and troubleshooting. By Simon Bisson Nov 07, 2024 8 mins GitHub Java Microsoft Azure Resources Videos