DIY LLMOps: Building Your Own AI Platform with Kubernetes and Open Source

Cloud platforms are convenient, but sometimes you need more control. Maybe it’s cost at scale, data sovereignty requirements, or just wanting to avoid vendor lock-in.

Good news: you can build a production-grade LLMOps platform using open-source tools. I’ve done it multiple times, and I’ll show you exactly how.

Series Finale: Part 7: MLOps/LLMOps Fundamentals → Part 8: Cloud Platforms → Part 9: DIY Implementation (You are here)

DIY LLMOps Architecture

Architecture Diagram

GitHub Actions for LLMOps

Let’s build a complete CI/CD pipeline for LLM applications using GitHub Actions.

Pipeline Architecture

Architecture Diagram

Complete GitHub Actions Workflow

# .github/workflows/llmops-pipeline.yml
name: LLMOps Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # Stage 1: Code Quality
  code-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'
      
      - name: Install dependencies
        run: |
          pip install ruff mypy pytest pytest-cov
          pip install -r requirements.txt
      
      - name: Lint with Ruff
        run: ruff check .
      
      - name: Type check with mypy
        run: mypy src/
      
      - name: Run unit tests
        run: pytest tests/unit -v --cov=src --cov-report=xml
      
      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          file: coverage.xml
  
  # Stage 2: Prompt Validation
  prompt-validation:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for diff
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install pydantic pyyaml
      
      - name: Validate prompt schemas
        run: python scripts/validate_prompts.py
      
      - name: Check prompt versions
        run: python scripts/check_prompt_versions.py
      
      - name: Generate prompt diff report
        if: github.event_name == 'pull_request'
        run: |
          python scripts/prompt_diff.py \
            --base ${{ github.event.pull_request.base.sha }} \
            --head ${{ github.sha }} \
            > prompt_diff.md
      
      - name: Post PR comment with diff
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const diff = fs.readFileSync('prompt_diff.md', 'utf8');
            if (diff.trim()) {
              github.rest.issues.createComment({
                issue_number: context.issue.number,
                owner: context.repo.owner,
                repo: context.repo.repo,
                body: '## Prompt Changes\n\n' + diff
              });
            }

  # Stage 3: LLM Evaluation
  llm-evaluation:
    runs-on: ubuntu-latest
    needs: [code-quality, prompt-validation]
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install -r requirements-eval.txt
      
      - name: Run evaluation suite
        run: |
          python -m evaluation.run \
            --config eval/config.yaml \
            --output eval_results/
      
      - name: Check quality gates
        run: |
          python -m evaluation.gates \
            --results eval_results/results.json \
            --min-relevance 0.85 \
            --min-faithfulness 0.90 \
            --max-latency-p95 3000
      
      - name: Upload evaluation results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results/
      
      - name: Post evaluation summary
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results/results.json', 'utf8'));
            
            const summary = `## LLM Evaluation Results
            
            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            | Relevance | ${results.relevance.toFixed(3)} | 0.85 | ${results.relevance >= 0.85 ? '✅' : '❌'} |
            | Faithfulness | ${results.faithfulness.toFixed(3)} | 0.90 | ${results.faithfulness >= 0.90 ? '✅' : '❌'} |
            | Latency P95 | ${results.latency_p95}ms | 3000ms | ${results.latency_p95 <= 3000 ? '✅' : '❌'} |
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: summary
            });

  # Stage 4: Build
  build:
    runs-on: ubuntu-latest
    needs: llm-evaluation
    permissions:
      contents: read
      packages: write
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      
      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
            type=semver,pattern={{version}}
      
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      
      - name: Sign image
        run: |
          cosign sign --yes ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}@${{ steps.build.outputs.digest }}

  # Stage 5: Deploy to Staging
  deploy-staging:
    runs-on: ubuntu-latest
    needs: build
    environment: staging
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up kubectl
        uses: azure/setup-kubectl@v3
      
      - name: Configure kubeconfig
        run: |
          echo "${{ secrets.KUBE_CONFIG_STAGING }}" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig
      
      - name: Deploy to staging
        run: |
          kubectl set image deployment/llm-app \
            llm-app=${{ needs.build.outputs.image-tag }} \
            -n staging
          kubectl rollout status deployment/llm-app -n staging
      
      - name: Run smoke tests
        run: |
          python tests/smoke/run.py --env staging
      
      - name: Run load test
        run: |
          k6 run tests/load/staging.js

  # Stage 6: Deploy to Production (manual approval)
  deploy-production:
    runs-on: ubuntu-latest
    needs: deploy-staging
    if: github.ref == 'refs/heads/main'
    environment: 
      name: production
      url: https://api.example.com
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up kubectl
        uses: azure/setup-kubectl@v3
      
      - name: Configure kubeconfig
        run: |
          echo "${{ secrets.KUBE_CONFIG_PROD }}" | base64 -d > kubeconfig
          export KUBECONFIG=kubeconfig
      
      - name: Canary deployment (10%)
        run: |
          kubectl apply -f k8s/canary.yaml
          sleep 60
      
      - name: Check canary metrics
        run: |
          python scripts/check_canary.py --threshold 0.99
      
      - name: Full rollout
        run: |
          kubectl set image deployment/llm-app \
            llm-app=${{ needs.build.outputs.image-tag }} \
            -n production
          kubectl rollout status deployment/llm-app -n production
      
      - name: Notify team
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "🚀 LLM App deployed to production",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Deployment Complete*\nImage: `${{ needs.build.outputs.image-tag }}`\nCommit: `${{ github.sha }}`"
                  }
                }
              ]
            }

Kubernetes Deployment Manifests

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-app
  labels:
    app: llm-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-app
  template:
    metadata:
      labels:
        app: llm-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
    spec:
      containers:
        - name: llm-app
          image: ghcr.io/myorg/llm-app:latest
          ports:
            - containerPort: 8000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-secrets
                  key: openai-api-key
            - name: REDIS_URL
              value: "redis://redis:6379"
            - name: QDRANT_URL
              value: "http://qdrant:6333"
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "2000m"
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
---
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-app
spec:
  selector:
    app: llm-app
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP
---
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-app
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

Self-Hosted Model Serving with vLLM

# k8s/vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama
  template:
    metadata:
      labels:
        app: vllm-llama
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-4-70B-Instruct"
            - "--tensor-parallel-size"
            - "2"
            - "--max-model-len"
            - "32768"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 2
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
      nodeSelector:
        gpu-type: a100
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

Observability with Langfuse

# observability.py
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from functools import wraps
import time

# Initialize Langfuse
langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://langfuse.your-domain.com"  # Self-hosted
)

@observe(as_type="generation")
def call_llm(prompt: str, model: str = "gpt-4o") -> str:
    """Traced LLM call."""
    from openai import OpenAI
    client = OpenAI()
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Update observation with metadata
    langfuse_context.update_current_observation(
        model=model,
        usage={
            "input": response.usage.prompt_tokens,
            "output": response.usage.completion_tokens
        }
    )
    
    return response.choices[0].message.content

@observe()
def rag_pipeline(query: str) -> dict:
    """Traced RAG pipeline."""
    
    # Retrieval span
    with langfuse_context.observe(name="retrieval") as span:
        docs = retrieve_documents(query)
        span.update(metadata={"num_docs": len(docs)})
    
    # Generation span  
    context = "\n".join([d["content"] for d in docs])
    answer = call_llm(f"Context: {context}\n\nQuestion: {query}")
    
    # Score the trace
    langfuse_context.score_current_trace(
        name="user_feedback",
        value=1,  # Set from actual user feedback
        comment="User marked as helpful"
    )
    
    return {"answer": answer, "sources": docs}

# Usage in FastAPI
from fastapi import FastAPI, Request
from langfuse.decorators import observe

app = FastAPI()

@app.post("/query")
@observe()
async def query_endpoint(request: Request):
    body = await request.json()
    
    # Set user context
    langfuse_context.update_current_trace(
        user_id=request.headers.get("X-User-ID"),
        session_id=request.headers.get("X-Session-ID"),
        metadata={"endpoint": "/query"}
    )
    
    result = rag_pipeline(body["query"])
    return result

Cost Comparison: Cloud vs Self-Hosted

Architecture Diagram

Key Takeaways

GitHub Actions: Full LLMOps pipeline with prompt validation, evaluation gates, and staged deployment

Kubernetes: Production-ready manifests with autoscaling, health checks, and GPU support

Self-hosted models: vLLM + Kubernetes for 70-80% cost savings at scale

Observability: Langfuse (self-hosted) for complete LLM tracing

GitOps: ArgoCD/Flux for declarative, auditable deployments

Series Conclusion

Over these 9 parts, we’ve covered the complete GenAI stack—from fundamentals to enterprise deployment. The technology is moving fast, but the patterns we’ve discussed are foundational.

Build iteratively. Start with managed services, then optimize with self-hosted as you scale. Invest in observability from day one. And always remember: the goal isn’t to use AI—it’s to solve problems.

Good luck building the future.

References & Further Reading

vLLM – vllm.ai – High-throughput LLM serving
Langfuse – langfuse.com – Open source LLM observability
Qdrant – qdrant.tech – Vector database
MLflow – mlflow.org – ML lifecycle management
ArgoCD – argo-cd.readthedocs.io – GitOps for Kubernetes
NVIDIA Triton – nvidia.com – Inference server

Thanks for following this series! Share your LLMOps setup on GitHub or connect on LinkedIn.

Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in

DIY LLMOps: Building Your Own AI Platform with Kubernetes and Open Source

DIY LLMOps Architecture

GitHub Actions for LLMOps

Pipeline Architecture

Complete GitHub Actions Workflow

Kubernetes Deployment Manifests

Self-Hosted Model Serving with vLLM

Observability with Langfuse

Cost Comparison: Cloud vs Self-Hosted

Key Takeaways

Series Conclusion

References & Further Reading

Discover more from Code, Cloud & Context

Leave a Reply

Searching in

DIY LLMOps Architecture

GitHub Actions for LLMOps

Pipeline Architecture

Complete GitHub Actions Workflow

Kubernetes Deployment Manifests

Self-Hosted Model Serving with vLLM

Observability with Langfuse

Cost Comparison: Cloud vs Self-Hosted

Key Takeaways

Series Conclusion

References & Further Reading

Share this article

Discover more from Code, Cloud & Context

Leave a Reply