Cloud platforms are convenient, but sometimes you need more control. Maybe it’s cost at scale, data sovereignty requirements, or just wanting to avoid vendor lock-in.
Good news: you can build a production-grade LLMOps platform using open-source tools. I’ve done it multiple times, and I’ll show you exactly how.
Series Finale: Part 7: MLOps/LLMOps Fundamentals → Part 8: Cloud Platforms → Part 9: DIY Implementation (You are here)
DIY LLMOps Architecture
GitHub Actions for LLMOps
Let’s build a complete CI/CD pipeline for LLM applications using GitHub Actions.
Pipeline Architecture
Complete GitHub Actions Workflow
# .github/workflows/llmops-pipeline.yml
name: LLMOps Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# Stage 1: Code Quality
code-quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install ruff mypy pytest pytest-cov
pip install -r requirements.txt
- name: Lint with Ruff
run: ruff check .
- name: Type check with mypy
run: mypy src/
- name: Run unit tests
run: pytest tests/unit -v --cov=src --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v4
with:
file: coverage.xml
# Stage 2: Prompt Validation
prompt-validation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for diff
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install pydantic pyyaml
- name: Validate prompt schemas
run: python scripts/validate_prompts.py
- name: Check prompt versions
run: python scripts/check_prompt_versions.py
- name: Generate prompt diff report
if: github.event_name == 'pull_request'
run: |
python scripts/prompt_diff.py \
--base ${{ github.event.pull_request.base.sha }} \
--head ${{ github.sha }} \
> prompt_diff.md
- name: Post PR comment with diff
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const diff = fs.readFileSync('prompt_diff.md', 'utf8');
if (diff.trim()) {
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '## Prompt Changes\n\n' + diff
});
}
# Stage 3: LLM Evaluation
llm-evaluation:
runs-on: ubuntu-latest
needs: [code-quality, prompt-validation]
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-eval.txt
- name: Run evaluation suite
run: |
python -m evaluation.run \
--config eval/config.yaml \
--output eval_results/
- name: Check quality gates
run: |
python -m evaluation.gates \
--results eval_results/results.json \
--min-relevance 0.85 \
--min-faithfulness 0.90 \
--max-latency-p95 3000
- name: Upload evaluation results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval_results/
- name: Post evaluation summary
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval_results/results.json', 'utf8'));
const summary = `## LLM Evaluation Results
| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
| Relevance | ${results.relevance.toFixed(3)} | 0.85 | ${results.relevance >= 0.85 ? '✅' : '❌'} |
| Faithfulness | ${results.faithfulness.toFixed(3)} | 0.90 | ${results.faithfulness >= 0.90 ? '✅' : '❌'} |
| Latency P95 | ${results.latency_p95}ms | 3000ms | ${results.latency_p95 <= 3000 ? '✅' : '❌'} |
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: summary
});
# Stage 4: Build
build:
runs-on: ubuntu-latest
needs: llm-evaluation
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
type=semver,pattern={{version}}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Sign image
run: |
cosign sign --yes ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}
# Stage 5: Deploy to Staging
deploy-staging:
runs-on: ubuntu-latest
needs: build
environment: staging
steps:
- uses: actions/checkout@v4
- name: Set up kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: |
echo "${{ secrets.KUBE_CONFIG_STAGING }}" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
- name: Deploy to staging
run: |
kubectl set image deployment/llm-app \
llm-app=${{ needs.build.outputs.image-tag }} \
-n staging
kubectl rollout status deployment/llm-app -n staging
- name: Run smoke tests
run: |
python tests/smoke/run.py --env staging
- name: Run load test
run: |
k6 run tests/load/staging.js
# Stage 6: Deploy to Production (manual approval)
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
if: github.ref == 'refs/heads/main'
environment:
name: production
url: https://api.example.com
steps:
- uses: actions/checkout@v4
- name: Set up kubectl
uses: azure/setup-kubectl@v3
- name: Configure kubeconfig
run: |
echo "${{ secrets.KUBE_CONFIG_PROD }}" | base64 -d > kubeconfig
export KUBECONFIG=kubeconfig
- name: Canary deployment (10%)
run: |
kubectl apply -f k8s/canary.yaml
sleep 60
- name: Check canary metrics
run: |
python scripts/check_canary.py --threshold 0.99
- name: Full rollout
run: |
kubectl set image deployment/llm-app \
llm-app=${{ needs.build.outputs.image-tag }} \
-n production
kubectl rollout status deployment/llm-app -n production
- name: Notify team
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "🚀 LLM App deployed to production",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Deployment Complete*\nImage: `${{ needs.build.outputs.image-tag }}`\nCommit: `${{ github.sha }}`"
}
}
]
}
Kubernetes Deployment Manifests
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-app
labels:
app: llm-app
spec:
replicas: 3
selector:
matchLabels:
app: llm-app
template:
metadata:
labels:
app: llm-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
spec:
containers:
- name: llm-app
image: ghcr.io/myorg/llm-app:latest
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-api-key
- name: REDIS_URL
value: "redis://redis:6379"
- name: QDRANT_URL
value: "http://qdrant:6333"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
---
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: llm-app
spec:
selector:
app: llm-app
ports:
- port: 80
targetPort: 8000
type: ClusterIP
---
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
Self-Hosted Model Serving with vLLM
# k8s/vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama
template:
metadata:
labels:
app: vllm-llama
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-4-70B-Instruct"
- "--tensor-parallel-size"
- "2"
- "--max-model-len"
- "32768"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 2
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
nodeSelector:
gpu-type: a100
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Observability with Langfuse
# observability.py
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from functools import wraps
import time
# Initialize Langfuse
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://langfuse.your-domain.com" # Self-hosted
)
@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
"""Traced LLM call."""
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
# Update observation with metadata
langfuse_context.update_current_observation(
model=model,
usage={
"input": response.usage.prompt_tokens,
"output": response.usage.completion_tokens
}
)
return response.choices[0].message.content
@observe()
def rag_pipeline(query: str) -> dict:
"""Traced RAG pipeline."""
# Retrieval span
with langfuse_context.observe(name="retrieval") as span:
docs = retrieve_documents(query)
span.update(metadata={"num_docs": len(docs)})
# Generation span
context = "\n".join([d["content"] for d in docs])
answer = call_llm(f"Context: {context}\n\nQuestion: {query}")
# Score the trace
langfuse_context.score_current_trace(
name="user_feedback",
value=1, # Set from actual user feedback
comment="User marked as helpful"
)
return {"answer": answer, "sources": docs}
# Usage in FastAPI
from fastapi import FastAPI, Request
from langfuse.decorators import observe
app = FastAPI()
@app.post("/query")
@observe()
async def query_endpoint(request: Request):
body = await request.json()
# Set user context
langfuse_context.update_current_trace(
user_id=request.headers.get("X-User-ID"),
session_id=request.headers.get("X-Session-ID"),
metadata={"endpoint": "/query"}
)
result = rag_pipeline(body["query"])
return result
Cost Comparison: Cloud vs Self-Hosted
Key Takeaways
- GitHub Actions: Full LLMOps pipeline with prompt validation, evaluation gates, and staged deployment
- Kubernetes: Production-ready manifests with autoscaling, health checks, and GPU support
- Self-hosted models: vLLM + Kubernetes for 70-80% cost savings at scale
- Observability: Langfuse (self-hosted) for complete LLM tracing
- GitOps: ArgoCD/Flux for declarative, auditable deployments
Series Conclusion
Over these 9 parts, we’ve covered the complete GenAI stack—from fundamentals to enterprise deployment. The technology is moving fast, but the patterns we’ve discussed are foundational.
Build iteratively. Start with managed services, then optimize with self-hosted as you scale. Invest in observability from day one. And always remember: the goal isn’t to use AI—it’s to solve problems.
Good luck building the future.
References & Further Reading
- vLLM – vllm.ai – High-throughput LLM serving
- Langfuse – langfuse.com – Open source LLM observability
- Qdrant – qdrant.tech – Vector database
- MLflow – mlflow.org – ML lifecycle management
- ArgoCD – argo-cd.readthedocs.io – GitOps for Kubernetes
- NVIDIA Triton – nvidia.com – Inference server
Thanks for following this series! Share your LLMOps setup on GitHub or connect on LinkedIn.
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.