Monday, November 4, 2024

Anthropic introduces prompt caching to reduce latency and costs

Anthropic has introduced a new feature to some of its Claude models that will allow developers to cut down on prompt costs and latency.

Prompt caching allows users to cache frequently used context so that it can be used in future API calls. According to the company, by equipping the model with background knowledge and example outputs from the past, costs can be reduced by up to 90% and latency by up to 85% for long prompts.

There are several use cases where prompt caching would be useful, including being able to keep a summarized version of a codebase for coding assistants to use, providing long-form documents in prompts, and providing detailed instruction sets with several examples of desired outputs. 

Users could also use it to essentially converse with long-form content like books, papers, documentation, and podcast transcripts. According to Anthropic’s testing, chatting with a book with 100,000 tokens cached takes 2.4 seconds, whereas doing the same without information cached takes 11.5 seconds. This equates to a 79% reduction in latency. 

It costs 25% more to cache an input token compared to the base input token price, but costs 10% less to actually use that cached content. Actual prices vary based on the specific model.

Prompt caching is now available as a public beta on Claude 3.5 Sonnet and Claude 3 Haiku, and Claude 3 Opus will be supported soon.


You may also like…

Anthropic adds prompt evaluation feature to Console

Anthropic updates Claude with new features to improve collaboration

Related Articles

Latest Articles