You’ve built something cool. It works in demos. Stakeholders are excited. Now comes the hard part: making it production-ready.
I’ve helped multiple enterprises deploy GenAI at scale. The gap between “it works on my laptop” and “it handles 10,000 requests reliably” is significant. Let’s close that gap.
Series Finale: Part 1: GenAI Intro → Part 2: LLMs → Part 3: Frameworks → Part 4: Agentic AI → Part 5: Building Agents → Part 6: Enterprise (You are here)
The Enterprise GenAI Stack
┌─────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ (Your Apps, APIs, Agents, Chatbots, Workflows) │
├─────────────────────────────────────────────────────────────────┤
│ ORCHESTRATION LAYER │
│ (LangChain, LlamaIndex, Custom Orchestration) │
├─────────────────────────────────────────────────────────────────┤
│ MODEL LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ OpenAI API │ │Azure OpenAI │ │Self-Hosted │ │
│ │ (GPT-4o) │ │ (GPT-4o) │ │(Llama 4) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ DATA LAYER │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │Vector Store │ │ Doc Store │ │ Cache Layer │ │
│ │(Pinecone) │ │(S3/Blob) │ │(Redis) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ PLATFORM LAYER │
│ (Kubernetes, Monitoring, Security, CI/CD) │
└─────────────────────────────────────────────────────────────────┘
Deployment Patterns
Pattern 1: API Gateway + Model Router
The most common pattern—route requests to appropriate models based on task type, cost, and availability.
# model_router.py
from litellm import completion
import time
from dataclasses import dataclass
from enum import Enum
class TaskComplexity(Enum):
SIMPLE = "simple" # Classification, extraction
MEDIUM = "medium" # Summarization, standard generation
COMPLEX = "complex" # Reasoning, code, analysis
CREATIVE = "creative" # Brainstorming, writing
@dataclass
class ModelConfig:
model: str
max_tokens: int
cost_per_1k_input: float
cost_per_1k_output: float
MODEL_CONFIGS = {
TaskComplexity.SIMPLE: ModelConfig("gpt-4o-mini", 1000, 0.00015, 0.0006),
TaskComplexity.MEDIUM: ModelConfig("gpt-4o", 2000, 0.005, 0.015),
TaskComplexity.COMPLEX: ModelConfig("claude-4-sonnet", 4000, 0.003, 0.015),
TaskComplexity.CREATIVE: ModelConfig("gpt-4o", 4000, 0.005, 0.015),
}
class ModelRouter:
def __init__(self):
self.request_counts = {}
self.total_cost = 0
def classify_task(self, prompt: str) -> TaskComplexity:
"""Classify task complexity based on prompt characteristics."""
prompt_lower = prompt.lower()
# Simple heuristics - use a classifier model in production
if any(w in prompt_lower for w in ["classify", "extract", "yes or no", "true or false"]):
return TaskComplexity.SIMPLE
elif any(w in prompt_lower for w in ["summarize", "explain", "describe"]):
return TaskComplexity.MEDIUM
elif any(w in prompt_lower for w in ["analyze", "debug", "implement", "design", "review"]):
return TaskComplexity.COMPLEX
elif any(w in prompt_lower for w in ["brainstorm", "creative", "story", "imagine"]):
return TaskComplexity.CREATIVE
else:
return TaskComplexity.MEDIUM
def route(self, prompt: str, messages: list,
override_complexity: TaskComplexity = None) -> dict:
"""Route request to appropriate model."""
complexity = override_complexity or self.classify_task(prompt)
config = MODEL_CONFIGS[complexity]
start_time = time.time()
response = completion(
model=config.model,
messages=messages,
max_tokens=config.max_tokens
)
latency = time.time() - start_time
# Track costs
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
cost = (input_tokens / 1000 * config.cost_per_1k_input +
output_tokens / 1000 * config.cost_per_1k_output)
self.total_cost += cost
return {
"content": response.choices[0].message.content,
"model": config.model,
"complexity": complexity.value,
"latency_ms": int(latency * 1000),
"cost": round(cost, 6),
"tokens": {"input": input_tokens, "output": output_tokens}
}
Pattern 2: Fallback Chain
# fallback_chain.py
from litellm import completion
from tenacity import retry, stop_after_attempt, wait_exponential
class FallbackChain:
"""Try models in order until one succeeds."""
def __init__(self):
self.model_chain = [
"gpt-4o", # Primary
"claude-4-sonnet", # First fallback
"gemini-2.5-pro", # Second fallback
"gpt-4o-mini", # Last resort (cheaper, faster)
]
def complete(self, messages: list, **kwargs) -> dict:
"""Try each model in chain until success."""
errors = []
for model in self.model_chain:
try:
response = completion(
model=model,
messages=messages,
timeout=30,
**kwargs
)
return {
"content": response.choices[0].message.content,
"model_used": model,
"fallback_count": len(errors)
}
except Exception as e:
errors.append({"model": model, "error": str(e)})
continue
raise Exception(f"All models failed: {errors}")
Observability: Seeing What’s Actually Happening
GenAI systems are non-deterministic. Without proper observability, debugging is nearly impossible.
Essential Metrics
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency (p50, p95, p99) | User experience | p95 > 5s |
| Token usage per request | Cost control | > 2x baseline |
| Error rate by model | Reliability | > 1% |
| Hallucination rate | Quality | Task-dependent |
| Cost per request | Budget | > budget / expected_requests |
| Cache hit rate | Efficiency | < 30% |
# observability.py
import time
import json
from datetime import datetime
from dataclasses import dataclass, asdict
import structlog
logger = structlog.get_logger()
@dataclass
class LLMTrace:
trace_id: str
timestamp: str
model: str
prompt_tokens: int
completion_tokens: int
latency_ms: int
cost_usd: float
status: str # success, error, timeout
error_message: str = None
cache_hit: bool = False
user_id: str = None
request_type: str = None
class LLMObserver:
"""Observability wrapper for LLM calls."""
def __init__(self, metrics_client=None):
self.metrics = metrics_client # DataDog, Prometheus, etc.
def observe(self, func):
"""Decorator to observe LLM calls."""
def wrapper(*args, **kwargs):
trace_id = self._generate_trace_id()
start_time = time.time()
try:
result = func(*args, **kwargs)
latency = (time.time() - start_time) * 1000
trace = LLMTrace(
trace_id=trace_id,
timestamp=datetime.utcnow().isoformat(),
model=result.get("model", "unknown"),
prompt_tokens=result.get("usage", {}).get("prompt_tokens", 0),
completion_tokens=result.get("usage", {}).get("completion_tokens", 0),
latency_ms=int(latency),
cost_usd=self._calculate_cost(result),
status="success",
cache_hit=result.get("cache_hit", False)
)
self._emit(trace)
return result
except Exception as e:
latency = (time.time() - start_time) * 1000
trace = LLMTrace(
trace_id=trace_id,
timestamp=datetime.utcnow().isoformat(),
model=kwargs.get("model", "unknown"),
prompt_tokens=0,
completion_tokens=0,
latency_ms=int(latency),
cost_usd=0,
status="error",
error_message=str(e)
)
self._emit(trace)
raise
return wrapper
def _emit(self, trace: LLMTrace):
"""Emit trace to logging and metrics."""
logger.info("llm_call", **asdict(trace))
if self.metrics:
self.metrics.histogram("llm.latency", trace.latency_ms,
tags=[f"model:{trace.model}"])
self.metrics.increment("llm.requests",
tags=[f"model:{trace.model}", f"status:{trace.status}"])
self.metrics.gauge("llm.cost", trace.cost_usd,
tags=[f"model:{trace.model}"])
Security: Protecting Your GenAI Systems
Prompt Injection Defense
# security.py
import re
from typing import Tuple
class PromptSecurityFilter:
"""Defense against prompt injection attacks."""
INJECTION_PATTERNS = [
r"ignore (previous|all|above) instructions",
r"disregard (your|the) (rules|instructions|guidelines)",
r"you are now",
r"new instructions:",
r"system prompt:",
r"<system>",
r"</system>",
r"\[INST\]",
r"\[/INST\]",
]
def __init__(self):
self.patterns = [re.compile(p, re.IGNORECASE) for p in self.INJECTION_PATTERNS]
def check_input(self, user_input: str) -> Tuple[bool, str]:
"""
Check user input for injection attempts.
Returns (is_safe, reason).
"""
for pattern in self.patterns:
if pattern.search(user_input):
return False, f"Potential injection detected: {pattern.pattern}"
return True, "Input appears safe"
def sanitize_for_prompt(self, user_input: str) -> str:
"""Sanitize user input before including in prompt."""
# Escape special characters
sanitized = user_input.replace("{", "{{").replace("}", "}}")
# Add clear delimiters
return f"\n{sanitized}\n "
# Usage in your application
security = PromptSecurityFilter()
def process_user_query(user_input: str):
is_safe, reason = security.check_input(user_input)
if not is_safe:
logger.warning("Blocked input", reason=reason, input=user_input[:100])
return {"error": "Invalid input"}
sanitized = security.sanitize_for_prompt(user_input)
# Now safe to use in prompt
prompt = f"""
You are a helpful assistant. Answer the user's question.
{sanitized}
"""
return call_llm(prompt)
Data Privacy Patterns
# pii_handling.py
import re
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
class PIIHandler:
"""Handle PII in prompts and responses."""
def __init__(self):
self.analyzer = AnalyzerEngine()
self.anonymizer = AnonymizerEngine()
self.pii_map = {} # For de-anonymization if needed
def anonymize(self, text: str) -> str:
"""Replace PII with placeholders."""
# Detect PII
results = self.analyzer.analyze(
text=text,
language="en",
entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER",
"CREDIT_CARD", "US_SSN", "IP_ADDRESS"]
)
# Anonymize
anonymized = self.anonymizer.anonymize(
text=text,
analyzer_results=results
)
return anonymized.text
def process_for_llm(self, user_input: str) -> Tuple[str, dict]:
"""
Anonymize input before sending to LLM.
Returns anonymized text and mapping for restoration.
"""
# Custom pattern for specific formats
patterns = {
"email": r'\b[\w.-]+@[\w.-]+\.\w+\b',
"phone": r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
"ssn": r'\b\d{3}-\d{2}-\d{4}\b',
}
mapping = {}
anonymized = user_input
for pii_type, pattern in patterns.items():
matches = re.findall(pattern, anonymized)
for i, match in enumerate(matches):
placeholder = f"[{pii_type.upper()}_{i}]"
mapping[placeholder] = match
anonymized = anonymized.replace(match, placeholder, 1)
return anonymized, mapping
Cost Management
# cost_management.py
from datetime import datetime, timedelta
from collections import defaultdict
import threading
class CostManager:
"""Track and limit LLM spending."""
def __init__(self, daily_budget_usd: float = 100.0):
self.daily_budget = daily_budget_usd
self.spending = defaultdict(float) # date -> amount
self.lock = threading.Lock()
def _today(self) -> str:
return datetime.utcnow().strftime("%Y-%m-%d")
def record_cost(self, cost_usd: float, model: str = None):
"""Record a cost."""
with self.lock:
self.spending[self._today()] += cost_usd
def get_remaining_budget(self) -> float:
"""Get remaining budget for today."""
return max(0, self.daily_budget - self.spending[self._today()])
def can_afford(self, estimated_cost: float) -> bool:
"""Check if we can afford a request."""
return self.get_remaining_budget() >= estimated_cost
def estimate_cost(self, prompt_tokens: int, max_completion_tokens: int,
model: str) -> float:
"""Estimate cost before making request."""
pricing = {
"gpt-4o": (0.005, 0.015),
"gpt-4o-mini": (0.00015, 0.0006),
"claude-4-sonnet": (0.003, 0.015),
"claude-4-opus": (0.015, 0.075),
"gemini-2.5-pro": (0.00125, 0.005),
}
input_rate, output_rate = pricing.get(model, (0.01, 0.03))
return (prompt_tokens / 1000 * input_rate +
max_completion_tokens / 1000 * output_rate)
# Usage middleware
cost_manager = CostManager(daily_budget_usd=500)
def cost_aware_completion(messages: list, model: str, max_tokens: int):
"""Completion with cost checks."""
# Estimate token count (rough)
prompt_text = " ".join([m["content"] for m in messages])
estimated_prompt_tokens = len(prompt_text) // 4
estimated_cost = cost_manager.estimate_cost(
estimated_prompt_tokens, max_tokens, model
)
if not cost_manager.can_afford(estimated_cost):
# Fallback to cheaper model or reject
if model != "gpt-4o-mini":
return cost_aware_completion(messages, "gpt-4o-mini", max_tokens)
else:
raise Exception("Daily budget exhausted")
response = completion(model=model, messages=messages, max_tokens=max_tokens)
# Record actual cost
actual_cost = cost_manager.estimate_cost(
response.usage.prompt_tokens,
response.usage.completion_tokens,
model
)
cost_manager.record_cost(actual_cost, model)
return response
Caching Strategies
# caching.py
import hashlib
import json
import redis
from typing import Optional
class SemanticCache:
"""Cache LLM responses with semantic similarity matching."""
def __init__(self, redis_client: redis.Redis, embeddings_model):
self.redis = redis_client
self.embeddings = embeddings_model
self.similarity_threshold = 0.95
def _hash_key(self, text: str) -> str:
"""Create deterministic hash for exact matches."""
return hashlib.sha256(text.encode()).hexdigest()[:16]
def get_exact(self, prompt: str) -> Optional[str]:
"""Check for exact match in cache."""
key = f"llm:exact:{self._hash_key(prompt)}"
cached = self.redis.get(key)
return cached.decode() if cached else None
def set_exact(self, prompt: str, response: str, ttl: int = 3600):
"""Cache an exact match."""
key = f"llm:exact:{self._hash_key(prompt)}"
self.redis.setex(key, ttl, response)
def get_semantic(self, prompt: str) -> Optional[str]:
"""Check for semantically similar cached response."""
# Get embedding for prompt
prompt_embedding = self.embeddings.embed_query(prompt)
# Search vector store for similar prompts
# (Implementation depends on your vector store)
similar = self.vector_store.similarity_search_with_score(
prompt_embedding, k=1
)
if similar and similar[0][1] >= self.similarity_threshold:
cached_key = similar[0][0].metadata["response_key"]
return self.redis.get(cached_key).decode()
return None
def cached_completion(self, messages: list, model: str, **kwargs):
"""Completion with caching."""
prompt = json.dumps(messages)
# Try exact match first (fastest)
cached = self.get_exact(prompt)
if cached:
return {"content": cached, "cache_hit": True, "cache_type": "exact"}
# Try semantic match
cached = self.get_semantic(prompt)
if cached:
return {"content": cached, "cache_hit": True, "cache_type": "semantic"}
# No cache hit - call LLM
response = completion(model=model, messages=messages, **kwargs)
content = response.choices[0].message.content
# Cache the response
self.set_exact(prompt, content)
return {"content": content, "cache_hit": False}
The Future: What’s Coming
Trends I’m Watching
- Smaller, specialized models: Fine-tuned models for specific tasks will often beat general-purpose giants
- On-device inference: Apple, Google, Qualcomm are pushing LLMs to edge devices
- Multi-modal by default: Text, images, audio, video in unified models
- Agentic workflows: More autonomous, multi-step AI systems
- Better reasoning: Models that can actually think, not just pattern match
- Regulation: EU AI Act and similar will shape enterprise adoption
Final Thoughts
We’re at an inflection point. GenAI is no longer experimental—it’s becoming infrastructure. The companies that figure out how to deploy it reliably, securely, and cost-effectively will have significant advantages.
But remember: AI is a tool, not magic. The fundamentals still matter—good architecture, clean code, proper testing, security-first design. GenAI amplifies your capabilities; it doesn’t replace engineering rigor.
Start small. Deploy something real. Learn from production. Iterate.
That’s how you build the future.
Series Recap
| Part | Focus | Key Takeaway |
|---|---|---|
| 1 | GenAI Foundations | Understand the landscape and basic concepts |
| 2 | LLMs Deep Dive | Prompting techniques and model selection |
| 3 | Frameworks | LangChain, LlamaIndex, and when to use each |
| 4 | Agentic AI | Building autonomous, tool-using systems |
| 5 | Building Agents | Practical implementation patterns |
| 6 | Enterprise | Production deployment and operations |
References & Further Reading
- Azure OpenAI Enterprise Patterns – learn.microsoft.com
- AWS Bedrock – aws.amazon.com/bedrock
- Google Vertex AI – cloud.google.com
- OWASP LLM Top 10 – LLM Security Risks
- Presidio (Microsoft PII Detection) – microsoft.github.io/presidio
- LangSmith (LangChain Observability) – smith.langchain.com
- EU AI Act – artificialintelligenceact.eu
Thanks for following this series! Connect with me on GitHub or LinkedIn. Let’s build something amazing.
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.