Implement semantic caching to avoid redundant LLM calls and reduce API costs.
Code Snippet
import hashlib
import json
from functools import lru_cache
from typing import Optional
class LLMCache:
def __init__(self):
self._cache = {}
def _hash_prompt(self, prompt: str, model: str) -> str:
"""Create deterministic hash for cache key."""
content = f"{model}:{prompt}"
return hashlib.sha256(content.encode()).hexdigest()
def get(self, prompt: str, model: str) -> Optional[str]:
key = self._hash_prompt(prompt, model)
return self._cache.get(key)
def set(self, prompt: str, model: str, response: str):
key = self._hash_prompt(prompt, model)
self._cache[key] = response
cache = LLMCache()
def cached_llm_call(prompt: str, model: str = "gpt-4") -> str:
# Check cache first
cached = cache.get(prompt, model)
if cached:
return cached
# Make actual API call
response = call_openai_api(prompt, model)
# Cache for future use
cache.set(prompt, model, response)
return response
Why This Helps
- Reduces API costs by 30-70% for repeated queries
- Faster response times for cached prompts
- Enables offline development and testing
How to Test
- Call same prompt twice, verify cache hit
- Monitor API call counts
When to Use
Any application with repeated or similar prompts. Chatbots, content generation, analysis.
Performance/Security Notes
Use Redis for production caching. Consider TTL for time-sensitive content.
References
Try this tip in your next project and share your results in the comments!
Discover more from Byte Architect
Subscribe to get the latest posts sent to your email.