Using cached prompts can save up to 90% on API input costs, the company said. Credit: T. Schneider / Shutterstock Anthropic announced Wednesday that it is introducing prompt caching in the application programming interface (API) to its Claude family of generative AI models, which will allow developers to save frequently used prompts between API calls. Prompt caching allows customers to provide Claude with long prompts that can then be referred to in subsequent requests without having to send the prompt again. “With prompt caching, customers can provide Claude with more background knowledge and example outputs—all while reducing costs by up to 90% and latency by up to 85% for long prompts,” the company said in its announcement. The feature is now available in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus, its largest model, coming “soon.” A 2023 paper from researchers at Yale University and Google explained that, by saving prompts on the inference server, developers can “significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.” “It is becoming expensive to use closed-source LLMs when the usage goes high,” noted Andy Thurai, VP and principal analyst at Constellation Research. “Many enterprises and developers are facing sticker shock, especially if they have to repeatably use the same prompts to get the same/similar responses from the LLMs, they still charge the same amount for every round trip. This is especially true when multiple users enter the same (or somewhat similar prompt) looking for similar answers many times a day.” Use cases for prompt caching Anthropic cited several use cases where prompt caching can be helpful, including in conversational agents, coding assistants, processing of large documents, and allowing users to query cached long form content such as books, papers, or transcripts. It also could be used to share instructions, procedures, and examples to fine-tune Claude’s responses, or as a way to enhance performance when multiple rounds of tool calls and iterative changes require multiple API calls. According to the documentation, when prompt caching is enabled, the system checks to see if each prompt it receives has been previously cached. If so, it uses the cached version, and if not, it caches the prompt for later use. Developers can define up to four cache breakpoints in a prompt, which are cached at 1024 token boundaries in Claude 3.5 Sonnet (and in Opus, when the feature is implemented) and 2048 tokens in Claude 3 Haiku. Shorter prompts cannot currently be cached. The cache life is five minutes, but it is refreshed every time the cached content is used. The new feature comes with a new pricing structure, with cache write tokens being 25% more expensive than base input tokens, and cache read tokens 90% cheaper. “Early customers have seen substantial speed and cost improvements with prompt caching for a variety of use cases—from including a full knowledge base to 100-shot examples to including each turn of a conversation in their prompt,” the company said. Raises security concerns However, there are concerns, noted Thomas Randall, director of AI market research at Info-Tech Research Group, “While prompt caching is the right direction for performance optimization and greater usage efficiency, it is important to highlight security best practices when utilizing caching within programming,” Randall said. “If prompts are shared across (or between) organizations that are not reset or reviewed appropriately, sensitive information within a cache may inadvertently be passed on.” Thurai pointed out that, while Anthropic is offering prompt caching as new, a few other LLM vendors are still experimenting with this option. He said that some open-source solutions on the market, such as GPTCache and Redis, store the results as embeddings and retrieve them first if they match the prompt, without even visiting the LLM. “Regardless of which option is used, this can offer huge savings if similar prompts are sent to an LLM many times,” Randall said. “I would expect other closed-source LLM providers to announce similar features soon as well.” Related content analysis Strategies to navigate the pitfalls of cloud costs Cloud providers waste a lot of their customers’ cloud dollars, but enterprises can take action. By David Linthicum Nov 15, 2024 6 mins Cloud Architecture Cloud Management Cloud Computing analysis Understanding Hyperlight, Microsoft’s minimal VM manager Microsoft is making its Rust-based, functions-focused VM tool available on Azure at last, ready to help event-driven applications at scale. By Simon Bisson Nov 14, 2024 8 mins Microsoft Azure Rust Serverless Computing how-to Docker tutorial: Get started with Docker volumes Learn the ins, outs, and limits of Docker's native technology for integrating containers with local file systems. By Serdar Yegulalp Nov 13, 2024 8 mins Devops Cloud Computing Software Development news Red Hat OpenShift AI unveils model registry, data drift detection Cloud-based AI and machine learning platform also adds support for Nvidia NIM, AMD GPUs, the vLLM runtime for KServe, KServe Modelcars, and LoRA fine-tuning. By Paul Krill Nov 12, 2024 3 mins Generative AI PaaS Artificial Intelligence Resources Videos