Artificial Intelligence 2024-04-15 5 min read

Cut LLM Costs 60% with Prompt Caching Strategies

Prompt caching eliminates redundant API charges by reusing identical context. Learn the strategies that cut production costs by 60% with minimal code changes.

Your LLM bill keeps climbing. Every API call processes the same system prompts, knowledge bases, and context windows. You're paying full price for computation you've already done. Prompt caching fixes this.

Major LLM providers—Claude, GPT-4, Gemini—now support prompt caching. The mechanics are simple: cache immutable context, pay a fraction for cache hits, and regenerate only when new information arrives. In production, this translates to a 60% reduction in token costs for teams processing high-volume requests against stable data.

Here's how to implement it.

Understanding Cache Economics

When you cache 10,000 tokens of context, you pay the full input cost once. Every subsequent request using that cache hits a 90% discount on cached tokens. If your system prompt plus knowledge base totals 5,000 tokens and you process 1,000 requests daily, basic math shows the savings:

Without caching: 5,000 tokens × 1,000 requests = 5 million tokens/day
With caching: 5,000 tokens (cached) + 500 tokens (new input per request) = 500,500 tokens/day

That's roughly a 90% reduction on static content.

Real-world results vary. Teams at LavaPi working with large document processing saw 60% overall savings because dynamic user queries still cost full price. The key is identifying which content is genuinely static.

Identify Cacheable Content

Not everything should be cached. Focus on:

System instructions (unchanged across requests)
Knowledge bases and reference documents
Few-shot examples
Conversation history in multi-turn workflows

Anything that changes per-user or per-request should stay outside the cache.

Example: Customer Support Bot

Your support system uses:

2,000-token system prompt
3,000-token product documentation
1,000-token FAQ section
500 dynamic tokens per customer query

Cache the first three. Pay full price only on the FAQ and customer input during updates.

Implementation Patterns

Claude with Prompt Caching

typescript
const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic();

const response = await client.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: 'You are a support agent for an e-commerce platform.',
    },
    {
      type: 'text',
      text: `[PRODUCT_KNOWLEDGE_BASE]\n${productDocs}`,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [
    {
      role: 'user',
      content: userQuery,
    },
  ],
});

console.log(`Cache creation tokens: ${response.usage.cache_creation_input_tokens}`);
console.log(`Cache read tokens: ${response.usage.cache_read_input_tokens}`);

The

code

cache_control

field marks content for caching. First request pays full price for creation tokens. Subsequent requests use read tokens at 90% discount.

Python Implementation with OpenAI

python
from openai import OpenAI

client = OpenAI(api_key="your-key")

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": system_prompt,
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": user_query
                }
            ]
        }
    ]
)

print(f"Prompt tokens: {response.usage.prompt_tokens}")
print(f"Cache tokens: {response.usage.cache_creation_input_tokens}")

Cache Invalidation Strategy

Caches expire after 5 minutes of inactivity (most providers). For longer persistence:

Regenerate weekly: Schedule cache refreshes when knowledge bases update
Version your context: Include a hash of your knowledge base in the cache key
Monitor hit rates: Track
code
```
cache_read_input_tokens
```
vs total input tokens

bash
# Example: Monitor cache performance
curl -X GET https://api.anthropic.com/v1/messages \
  -H "Authorization: Bearer $API_KEY" \
  | jq '.usage | {cache_creation_input_tokens, cache_read_input_tokens}'

Real-World Numbers

A typical SaaS handling 10,000 daily support requests with 5,000 tokens of static context:

Monthly spend without caching: ~$1,500 (5M tokens/day × 30 days)
Monthly spend with caching: ~$600 (60% reduction)
Savings: $900/month, $10,800/year

Scaling to 100,000 requests daily multiplies the impact proportionally.

The Takeaway

Prompt caching isn't optional anymore—it's the baseline for cost-conscious LLM deployments. The implementation is straightforward, the savings are measurable, and the technical risk is minimal. If you're processing requests against static context, you're leaving money on the table.

ShareX LinkedIn Facebook

LavaPi Team

Digital Engineering Company

All articles