Tips and Tricks #95: Cache LLM Responses for Cost Reduction

Implement semantic caching to avoid redundant LLM calls and reduce API costs.

Code Snippet

import hashlib
import json
from functools import lru_cache
from typing import Optional

class LLMCache:
    def __init__(self):
        self._cache = {}
    
    def _hash_prompt(self, prompt: str, model: str) -> str:
        """Create deterministic hash for cache key."""
        content = f"{model}:{prompt}"
        return hashlib.sha256(content.encode()).hexdigest()
    
    def get(self, prompt: str, model: str) -> Optional[str]:
        key = self._hash_prompt(prompt, model)
        return self._cache.get(key)
    
    def set(self, prompt: str, model: str, response: str):
        key = self._hash_prompt(prompt, model)
        self._cache[key] = response

cache = LLMCache()

def cached_llm_call(prompt: str, model: str = "gpt-4") -> str:
    # Check cache first
    cached = cache.get(prompt, model)
    if cached:
        return cached
    
    # Make actual API call
    response = call_openai_api(prompt, model)
    
    # Cache for future use
    cache.set(prompt, model, response)
    return response

Why This Helps

  • Reduces API costs by 30-70% for repeated queries
  • Faster response times for cached prompts
  • Enables offline development and testing

How to Test

  • Call same prompt twice, verify cache hit
  • Monitor API call counts

When to Use

Any application with repeated or similar prompts. Chatbots, content generation, analysis.

Performance/Security Notes

Use Redis for production caching. Consider TTL for time-sensitive content.

References


Try this tip in your next project and share your results in the comments!


Discover more from Byte Architect

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.