Home Artificial Intelligence OpenAI previews Realtime API for speech-to-speech apps

by Paul Krill

Editor at Large

OpenAI previews Realtime API for speech-to-speech apps

news

Oct 02, 20243 mins

APIsArtificial IntelligenceDevelopment Tools

Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API.

Image of a person typing on a keyboard. Text, speech, typing.

Credit: Tero Vesalainen/Shutterstock

OpenAI has introduced a public beta of the Realtime API, an API that allows paid developers to build low-latency, multi-modal experiences including text and speech in apps.

Introduced October 1, the Realtime API, similar to the OpenAI ChatGPT Advanced Voice Mode, supports natural speech-to-speech conversations using preset voices that the API already supports. OpenAI also is introducing audio input and output in the Chat Completions API to support use cases that do not need the low-latency benefits of the Realtime API. Developers can pass text or audio inputs into GPT-4o and have the model respond with text, audio, or both.

With the Realtime API and the audio support in the Chat Completions API, developers do not have to link together multiple models to power voice experiences. They can build natural conversational experiences with just one API call, OpenAI said. Previously, creating a similar voice experience had developers transcribing an automatic speech recognition model such as Whisper, passing text to a text model for inference or reasoning, and playing the model’s output using a text-to-speech model. This approach often resulted in loss of emotion, emphasis, and accents, plus latency.

With the Chat Completions API, developers can deal with the entire process with one API call, though it remains slower than human conversation. The Realtime API improves latency by streaming audio inputs and outputs directly, enabling more natural conversational experiences, OpenAI said. The Realtime API also can handle interruptions automatically, like ChatGPT’s advanced voice mode.

The Realtime API enables development of a persistent WebSocket connection to exchange messages with GPT-4o. The API backs function calling, which makes it possible for voice assistants to respond to user requests by pulling in new context or triggering actions. Also, the Realtime API leverages multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs.

The Realtime API uses text tokens and audio tokens. Text input costs $5 per 1M tokens and text output costs $20 per 1M tokens. Audio input costs $100 per 1M tokens and audio output costs $200 per 1M tokens.

OpenAI said plans for improving the Realtime API include adding support for vision and video, increasing rate limits, adding support for prompt caching, and expanding model support to GPT-4o mini. The company said it would also integrate support for the Realtime API into the OpenAI Python and Node.js SDKs.

by Paul Krill

Editor at Large

Paul Krill is an editor at large at InfoWorld, focusing on coverage of application development (desktop and mobile) and core web technologies such as Java.

Topics

About

Policies

Our Network

More

OpenAI previews Realtime API for speech-to-speech apps

Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API.

More from this author

Spin 3.0 supports polyglot development using Wasm components

Go language evolving for future hardware, AI workloads

JDK 24: The new features in Java 24

Rust Foundation moves forward on C++ and Rust interoperability

TypeScript 5.7 improves error reporting

Visual Studio 17.12 brings C++, Copilot enhancements

Microsoft’s .NET 9 arrives, with performance, cloud, and AI boosts

Red Hat OpenShift AI unveils model registry, data drift detection

Show me more

The dirty little secret of open source contributions

Designing the APIs that accidentally power businesses

Strategies to navigate the pitfalls of cloud costs

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx

OpenAI previews Realtime API for speech-to-speech apps

Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API.

Related content

14 great preprocessors for developers who love to code

JetBrains IDEs ease debugging for Kubernetes apps

Understanding Hyperlight, Microsoft’s minimal VM manager

GitHub Copilot learns new tricks

More from this author

Spin 3.0 supports polyglot development using Wasm components

Go language evolving for future hardware, AI workloads

JDK 24: The new features in Java 24

Rust Foundation moves forward on C++ and Rust interoperability

TypeScript 5.7 improves error reporting

Visual Studio 17.12 brings C++, Copilot enhancements

Microsoft’s .NET 9 arrives, with performance, cloud, and AI boosts

Red Hat OpenShift AI unveils model registry, data drift detection

Show me more

The dirty little secret of open source contributions

Designing the APIs that accidentally power businesses

Strategies to navigate the pitfalls of cloud costs

Building Python wheels to distribute your programs

Creating a pip install-able Python package

How to get better web requests in Python with httpx