2024-04-15 5 min read

Cut LLM Costs 60% with Prompt Caching Strategies

Prompt caching eliminates redundant API charges by reusing identical context. Learn the strategies that cut production costs by 60% with minimal code changes.

Your LLM bill keeps climbing. Every API call processes the same system prompts, knowledge bases, and context windows. You're paying full price for computation you've already done. Prompt caching fixes this.

Major LLM providers—Claude, GPT-4, Gemini—now support prompt caching. The mechanics are simple: cache immutable context, pay a fraction for cache hits, and regenerate only when new information arrives. In production, this translates to a 60% reduction in token costs for teams processing high-volume requests against stable data.

Here's how to implement it.

Understanding Cache Economics

When you cache 10,000 tokens of context, you pay the full input cost once. Every subsequent request using that cache hits a 90% discount on cached tokens. If your system prompt plus knowledge base totals 5,000 tokens and you process 1,000 requests daily, basic math shows the savings:

  • Without caching: 5,000 tokens × 1,000 requests = 5 million tokens/day
  • With caching: 5,000 tokens (cached) + 500 tokens (new input per request) = 500,500 tokens/day

That's roughly a 90% reduction on static content.

Real-world results vary. Teams at LavaPi working with large document processing saw 60% overall savings because dynamic user queries still cost full price. The key is identifying which content is genuinely static.

Identify Cacheable Content

Not everything should be cached. Focus on:

  • System instructions (unchanged across requests)
  • Knowledge bases and reference documents
  • Few-shot examples
  • Conversation history in multi-turn workflows

Anything that changes per-user or per-request should stay outside the cache.

Example: Customer Support Bot

Your support system uses:

  • 2,000-token system prompt
  • 3,000-token product documentation
  • 1,000-token FAQ section
  • 500 dynamic tokens per customer query

Cache the first three. Pay full price only on the FAQ and customer input during updates.

Implementation Patterns

Claude with Prompt Caching

typescript
const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic();

const response = await client.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: 'You are a support agent for an e-commerce platform.',
    },
    {
      type: 'text',
      text: `[PRODUCT_KNOWLEDGE_BASE]\n${productDocs}`,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [
    {
      role: 'user',
      content: userQuery,
    },
  ],
});

console.log(`Cache creation tokens: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read tokens: ${response.usage.cache_read_input_tokens}`);

The

code
cache_control
field marks content for caching. First request pays full price for creation tokens. Subsequent requests use read tokens at 90% discount.

Python Implementation with OpenAI

python
from openai import OpenAI

client = OpenAI(api_key="your-key")

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": system_prompt,
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": user_query
                }
            ]
        }
    ]
)

print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Cache tokens: {response.usage.cache_creation_input_tokens}")

Cache Invalidation Strategy

Caches expire after 5 minutes of inactivity (most providers). For longer persistence:

  • Regenerate weekly: Schedule cache refreshes when knowledge bases update
  • Version your context: Include a hash of your knowledge base in the cache key
  • Monitor hit rates: Track
    code
    cache_read_input_tokens
    vs total input tokens
bash
# Example: Monitor cache performance
curl -X GET https://api.anthropic.com/v1/messages \
  -H "Authorization: Bearer $API_KEY" \
  | jq '.usage | {cache_creation_input_tokens, cache_read_input_tokens}'

Real-World Numbers

A typical SaaS handling 10,000 daily support requests with 5,000 tokens of static context:

  • Monthly spend without caching: ~$1,500 (5M tokens/day × 30 days)
  • Monthly spend with caching: ~$600 (60% reduction)
  • Savings: $900/month, $10,800/year

Scaling to 100,000 requests daily multiplies the impact proportionally.

The Takeaway

Prompt caching isn't optional anymore—it's the baseline for cost-conscious LLM deployments. The implementation is straightforward, the savings are measurable, and the technical risk is minimal. If you're processing requests against static context, you're leaving money on the table.

Share
LP

LavaPi Team

Digital Engineering Company

All articles