Prompt Caching

Week 3: Advanced Techniques in Prompt Engineering

Prompt Caching | Become a Prompt Engineer

Understanding Prompt Caching

Prompt caching allows you to store and reuse large amounts of context between API calls, significantly reducing costs and latency for applications that use consistent background information or instructions.

Prompt caching empowers your applications to handle large contexts efficiently, improving performance and reducing costs for repetitive tasks or prompts with consistent elements.

When to Use Prompt Caching

Consider using prompt caching when you need to:

  • Process large documents or datasets repeatedly
  • Maintain consistent instructions or context across multiple queries
  • Optimize costs for applications with frequent, similar requests
  • Improve response times for context-heavy applications

Prompt Caching Process

  1. Structure your prompt with static content at the beginning
  2. Mark the cacheable sections using the cache_control parameter
  3. Send requests with the cached content
  4. Benefit from reduced processing time and costs on subsequent calls

Prompt Caching Example: Warren Buffett Letter Analysis

Let's walk through a complete example of implementing prompt caching with Claude for analyzing Warren Buffett's letters to investors:

Dataset Information

We'll be using the "Warren Buffett Letters to Investors (1977 - 2021)" dataset, which contains a collection of letters written by Warren Buffett to Berkshire Hathaway shareholders. This dataset provides valuable insights into Buffett's investment philosophy, business strategies, and market observations. For this example, we'll just use 2 years of letters from 1990-91, which is still ~50,000 tokens or 150,000+ characters

Dataset source: Kaggle - Warren Buffett Letters to Investors

Step-by-Step Implementation

  1. Load the Warren Buffett letters content
  2. Set up the system message with cached content
  3. Create a function to handle user queries
  4. Test the implementation with multiple queries

import anthropic

print("Step 1: Load the Warren Buffett letters content")
with open("academy/warren_buffet_letters.txt", "r", encoding="utf-8") as file:
    letters_text = file.read()
print(f"Loaded letters with {len(letters_text)} characters")

print("\nStep 2: Set up the system message with cached content")
system_message = [
    {
        "type": "text",
        "text": "You are an AI assistant tasked with providing insights about Warren Buffett's investment philosophy and business strategies based on his letters to Berkshire Hathaway shareholders from 1977 to 2021. Your goal is to answer questions based on the content of these letters.\n",
    },
    {
        "type": "text",
        "text": letters_text,
        "cache_control": {"type": "ephemeral"}
    }
]

print("\nStep 3: Create a function to handle user queries")
def ask_question(question):
    print(f"Asking: '{question}'")
    response = client.beta.prompt_caching.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=1024,
        system=system_message,
        messages=[{"role": "user", "content": question}],
    )
    # Extract the text content from the TextBlock
    response_text = response.content[0].text
    return response_text

print("\nStep 4: Test the implementation with multiple queries")
questions = [
    "What are Warren Buffett's key principles for value investing?",
    "How has Buffett's approach to acquisitions evolved over the years?",
    "What are Buffett's views on market volatility and economic cycles?",
    "How does Buffett evaluate the management of companies he invests in?"
]

for question in questions:
    print(f"\nQuestion: {question}")
    answer = ask_question(question)
    print(f"Answer: {answer}")

print("\nPrompt caching example completed")

Key Concepts Explained

  • Cache Control: Marks specific parts of your prompt for caching, allowing reuse in subsequent calls.
  • Ephemeral Caching: Cached content has a 5-minute lifetime, refreshed with each use.
  • Prefix Caching: The entire prompt up to and including the cache_control block is cached.
  • Cache Hits: Subsequent calls within the cache lifetime benefit from reduced processing and costs.

Best Practices for Anthropic Prompt Caching

  1. Strategic Content Placement: Place static, reusable content at the beginning of your prompt for optimal caching.
  2. Minimum Cache Size: Ensure cached content meets the minimum token requirements (1024 for Sonnet/Opus, 2048 for Haiku).
  3. Consistent Caching: Keep cached sections identical across calls for effective hits.
  4. Cache Breakpoints: Use up to 4 cache breakpoints to separate different reusable sections.
  5. Performance Monitoring: Track cache hit rates and adjust your strategy as needed.

Advanced Prompt Caching Techniques

As you become more comfortable with prompt caching, consider exploring these advanced techniques:

  • Caching tool definitions for consistent function calling setups
  • Implementing caching for multi-turn conversations
  • Combining prompt caching with other optimization strategies

Summary

Prompt caching enables efficient reuse of large contexts and consistent instructions in your Anthropic API calls. By strategically structuring your prompts and utilizing cache control, you can create AI applications that handle complex tasks with improved performance and reduced costs.