Tips and Tricks #191: Cache LLM Responses for Cost Reduction

Implement semantic caching to avoid redundant LLM calls and reduce API costs.

Code Snippet

import hashlib
import json
from functools import lru_cache
from typing import Optional

class LLMCache:
    def __init__(self):
        self._cache = {}
    
    def _hash_prompt(self, prompt: str, model: str) -> str:
        """Create deterministic hash for cache key."""
        content = f"{model}:{prompt}"
        return hashlib.sha256(content.encode()).hexdigest()
    
    def get(self, prompt: str, model: str) -> Optional[str]:
        key = self._hash_prompt(prompt, model)
        return self._cache.get(key)
    
    def set(self, prompt: str, model: str, response: str):
        key = self._hash_prompt(prompt, model)
        self._cache[key] = response

cache = LLMCache()

def cached_llm_call(prompt: str, model: str = "gpt-4") -> str:
    # Check cache first
    cached = cache.get(prompt, model)
    if cached:
        return cached
    
    # Make actual API call
    response = call_openai_api(prompt, model)
    
    # Cache for future use
    cache.set(prompt, model, response)
    return response

Why This Helps

  • Reduces API costs by 30-70% for repeated queries
  • Faster response times for cached prompts
  • Enables offline development and testing

How to Test

  • Call same prompt twice, verify cache hit
  • Monitor API call counts

When to Use

Any application with repeated or similar prompts. Chatbots, content generation, analysis.

Performance/Security Notes

Use Redis for production caching. Consider TTL for time-sensitive content.

References


Try this tip in your next project and share your results in the comments!


Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.