Large Language Models – C4: Container, Code, Cloud & Context

Streaming Responses for LLMs: Implementing Server-Sent Events

Posted on June 15, 2025 by Nithin Mohan TK 10 min read

Streaming LLM responses dramatically improves user experience. After implementing streaming for 20+ LLM applications, I’ve learned what works. Here’s the complete guide to implementing Server-Sent Events for LLM streaming. Figure 1: Streaming Architecture Why Streaming Matters Streaming LLM responses provides significant benefits: Perceived performance: Users see results immediately, not after 10+ seconds Better UX: Progressive […]

Read more →

Quantization Methods for LLMs: GPTQ, AWQ, and BitsAndBytes

Posted on April 8, 2025 by Nithin Mohan TK 5 min read

Last year, I needed to run a 13B parameter model on a 16GB GPU. Full precision required 52GB. After testing GPTQ, AWQ, and BitsAndBytes, I reduced memory to 7GB with minimal accuracy loss. After quantizing 30+ models, I’ve learned which method works best for each scenario. Here’s the complete guide to LLM quantization. Figure 1: […]

Read more →

Running LLMs on Kubernetes: Production Deployment Guide

Posted on February 20, 2025 by Nithin Mohan TK 7 min read

Deploying LLMs on Kubernetes requires careful planning. After deploying 25+ LLM models on Kubernetes, I’ve learned what works. Here’s the complete guide to running LLMs on Kubernetes in production. Figure 1: Kubernetes LLM Architecture Why Kubernetes for LLMs Kubernetes offers significant advantages for LLM deployment: Scalability: Auto-scale based on demand Resource management: Efficient GPU and […]

Read more →

Google Gemini API: Building Multimodal AI Applications with 2M Token Context

Posted on December 8, 2024 by Nithin Mohan TK 7 min read

Introduction: Google’s Gemini API represents a significant leap in multimodal AI capabilities. Launched in December 2023, Gemini models are natively multimodal, trained from the ground up to understand and generate text, images, audio, and video. With context windows up to 2 million tokens and native Google Search grounding, Gemini offers unique capabilities for building sophisticated […]

Read more →

Serverless AI Architecture: Building Scalable LLM Applications

Posted on December 5, 2024 by Nithin Mohan TK 6 min read

Three years ago, I built my first serverless LLM application. It failed spectacularly. Cold starts made responses take 15 seconds. Timeouts killed long-running requests. Costs spiraled out of control. After architecting 30+ serverless AI systems, I’ve learned what works. Here’s the complete guide to building scalable serverless LLM applications. Figure 1: Serverless AI Architecture Overview […]

Read more →

Deploying LLM Applications on Cloud Run: A Complete Guide

Posted on November 5, 2024 by Nithin Mohan TK 6 min read

Last year, I deployed our first LLM application to Cloud Run. What should have taken hours took three days. Cold starts killed our latency. Memory limits caused crashes. Timeouts broke long-running requests. After deploying 20+ LLM applications to Cloud Run, I’ve learned what works and what doesn’t. Here’s the complete guide. Figure 1: Cloud Run […]

Read more →

Searching in

Tag: Large Language Models

Streaming Responses for LLMs: Implementing Server-Sent Events

Quantization Methods for LLMs: GPTQ, AWQ, and BitsAndBytes

Running LLMs on Kubernetes: Production Deployment Guide

Google Gemini API: Building Multimodal AI Applications with 2M Token Context