Implement semantic caching to avoid redundant LLM calls and reduce API costs.
Code Snippet
import hashlib
import json
from functools import lru_cache
from typing import Optional
class LLMCache:
def __init__(self):
self._cache = {}
def _hash_prompt(self, prompt: str, model: str) -> str:
"""Create deterministic hash for cache key."""
content = f"{model}:{prompt}"
return hashlib.sha256(content.encode()).hexdigest()
def get(self, prompt: str, model: str) -> Optional[str]:
key = self._hash_prompt(prompt, model)
return self._cache.get(key)
def set(self, prompt: str, model: str, response: str):
key = self._hash_prompt(prompt, model)
self._cache[key] = response
cache = LLMCache()
def cached_llm_call(prompt: str, model: str = "gpt-4") -> str:
# Check cache first
cached = cache.get(prompt, model)
if cached:
return cached
# Make actual API call
response = call_openai_api(prompt, model)
# Cache for future use
cache.set(prompt, model, response)
return response
Why This Helps
- Reduces API costs by 30-70% for repeated queries
- Faster response times for cached prompts
- Enables offline development and testing
How to Test
- Call same prompt twice, verify cache hit
- Monitor API call counts
When to Use
Any application with repeated or similar prompts. Chatbots, content generation, analysis.
Performance/Security Notes
Use Redis for production caching. Consider TTL for time-sensitive content.
References
Try this tip in your next project and share your results in the comments!
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.