Prompt Caching
Week 3: Advanced Techniques in Prompt Engineering
Understanding Prompt Caching
Prompt caching allows you to store and reuse large amounts of context between API calls, significantly reducing costs and latency for applications that use consistent background information or instructions.
Prompt caching empowers your applications to handle large contexts efficiently, improving performance and reducing costs for repetitive tasks or prompts with consistent elements.
When to Use Prompt Caching
Consider using prompt caching when you need to:
- Process large documents or datasets repeatedly
- Maintain consistent instructions or context across multiple queries
- Optimize costs for applications with frequent, similar requests
- Improve response times for context-heavy applications
Prompt Caching Process
- Structure your prompt with static content at the beginning
- Mark the cacheable sections using the
cache_control
parameter - Send requests with the cached content
- Benefit from reduced processing time and costs on subsequent calls
Prompt Caching Example: Warren Buffett Letter Analysis
Let's walk through a complete example of implementing prompt caching with Claude for analyzing Warren Buffett's letters to investors:
Dataset Information
We'll be using the "Warren Buffett Letters to Investors (1977 - 2021)" dataset, which contains a collection of letters written by Warren Buffett to Berkshire Hathaway shareholders. This dataset provides valuable insights into Buffett's investment philosophy, business strategies, and market observations. For this example, we'll just use 2 years of letters from 1990-91, which is still ~50,000 tokens or 150,000+ characters
Dataset source: Kaggle - Warren Buffett Letters to Investors
Step-by-Step Implementation
- Load the Warren Buffett letters content
- Set up the system message with cached content
- Create a function to handle user queries
- Test the implementation with multiple queries
import anthropic
print("Step 1: Load the Warren Buffett letters content")
with open("academy/warren_buffet_letters.txt", "r", encoding="utf-8") as file:
letters_text = file.read()
print(f"Loaded letters with {len(letters_text)} characters")
print("\nStep 2: Set up the system message with cached content")
system_message = [
{
"type": "text",
"text": "You are an AI assistant tasked with providing insights about Warren Buffett's investment philosophy and business strategies based on his letters to Berkshire Hathaway shareholders from 1977 to 2021. Your goal is to answer questions based on the content of these letters.\n",
},
{
"type": "text",
"text": letters_text,
"cache_control": {"type": "ephemeral"}
}
]
print("\nStep 3: Create a function to handle user queries")
def ask_question(question):
print(f"Asking: '{question}'")
response = client.beta.prompt_caching.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1024,
system=system_message,
messages=[{"role": "user", "content": question}],
)
# Extract the text content from the TextBlock
response_text = response.content[0].text
return response_text
print("\nStep 4: Test the implementation with multiple queries")
questions = [
"What are Warren Buffett's key principles for value investing?",
"How has Buffett's approach to acquisitions evolved over the years?",
"What are Buffett's views on market volatility and economic cycles?",
"How does Buffett evaluate the management of companies he invests in?"
]
for question in questions:
print(f"\nQuestion: {question}")
answer = ask_question(question)
print(f"Answer: {answer}")
print("\nPrompt caching example completed")
Key Concepts Explained
- Cache Control: Marks specific parts of your prompt for caching, allowing reuse in subsequent calls.
- Ephemeral Caching: Cached content has a 5-minute lifetime, refreshed with each use.
- Prefix Caching: The entire prompt up to and including the cache_control block is cached.
- Cache Hits: Subsequent calls within the cache lifetime benefit from reduced processing and costs.
Best Practices for Anthropic Prompt Caching
- Strategic Content Placement: Place static, reusable content at the beginning of your prompt for optimal caching.
- Minimum Cache Size: Ensure cached content meets the minimum token requirements (1024 for Sonnet/Opus, 2048 for Haiku).
- Consistent Caching: Keep cached sections identical across calls for effective hits.
- Cache Breakpoints: Use up to 4 cache breakpoints to separate different reusable sections.
- Performance Monitoring: Track cache hit rates and adjust your strategy as needed.
Advanced Prompt Caching Techniques
As you become more comfortable with prompt caching, consider exploring these advanced techniques:
- Caching tool definitions for consistent function calling setups
- Implementing caching for multi-turn conversations
- Combining prompt caching with other optimization strategies
Summary
Prompt caching enables efficient reuse of large contexts and consistent instructions in your Anthropic API calls. By strategically structuring your prompts and utilizing cache control, you can create AI applications that handle complex tasks with improved performance and reduced costs.