AI Models – C4: Container, Code, Cloud & Context

Getting Started with Microsoft Foundry Local: Run AI Models On-Device Without the Cloud

Posted on December 20, 2025 by Nithin Mohan TK 8 min read

Microsoft Foundry Local brings the power of Azure AI Foundry directly to your local device, enabling you to run state-of-the-art AI models without cloud dependencies. Announced at Microsoft Build 2025 and continuously enhanced since, Foundry Local represents a paradigm shift in how developers can build AI-powered applications—with complete data privacy, zero API costs, and offline […]

Read more →

The Evolution of Anthropic Claude: From 3.5 to 4.5 Opus – A Technical Deep Dive

Posted on December 18, 2025 by Nithin Mohan TK 7 min read

Having worked with AI models for over two decades, I’ve witnessed countless technological shifts, but few have been as remarkable as Anthropic’s Claude evolution. From the initial Claude 1.0 release in March 2023 to the groundbreaking Claude 4.5 Opus in late 2025, Anthropic has consistently pushed the boundaries of what’s possible with large language models. […]

Read more →

Streaming Responses for LLMs: Implementing Server-Sent Events

Posted on June 15, 2025 by Nithin Mohan TK 10 min read

Streaming LLM responses dramatically improves user experience. After implementing streaming for 20+ LLM applications, I’ve learned what works. Here’s the complete guide to implementing Server-Sent Events for LLM streaming. Figure 1: Streaming Architecture Why Streaming Matters Streaming LLM responses provides significant benefits: Perceived performance: Users see results immediately, not after 10+ seconds Better UX: Progressive […]

Read more →

Quantization Methods for LLMs: GPTQ, AWQ, and BitsAndBytes

Posted on April 8, 2025 by Nithin Mohan TK 5 min read

Last year, I needed to run a 13B parameter model on a 16GB GPU. Full precision required 52GB. After testing GPTQ, AWQ, and BitsAndBytes, I reduced memory to 7GB with minimal accuracy loss. After quantizing 30+ models, I’ve learned which method works best for each scenario. Here’s the complete guide to LLM quantization. Figure 1: […]

Read more →

Running LLMs on Kubernetes: Production Deployment Guide

Posted on February 20, 2025 by Nithin Mohan TK 7 min read

Deploying LLMs on Kubernetes requires careful planning. After deploying 25+ LLM models on Kubernetes, I’ve learned what works. Here’s the complete guide to running LLMs on Kubernetes in production. Figure 1: Kubernetes LLM Architecture Why Kubernetes for LLMs Kubernetes offers significant advantages for LLM deployment: Scalability: Auto-scale based on demand Resource management: Efficient GPU and […]

Read more →

Serverless AI Architecture: Building Scalable LLM Applications

Posted on December 5, 2024 by Nithin Mohan TK 6 min read

Three years ago, I built my first serverless LLM application. It failed spectacularly. Cold starts made responses take 15 seconds. Timeouts killed long-running requests. Costs spiraled out of control. After architecting 30+ serverless AI systems, I’ve learned what works. Here’s the complete guide to building scalable serverless LLM applications. Figure 1: Serverless AI Architecture Overview […]

Read more →

Searching in

Tag: AI Models

Getting Started with Microsoft Foundry Local: Run AI Models On-Device Without the Cloud

The Evolution of Anthropic Claude: From 3.5 to 4.5 Opus – A Technical Deep Dive

Streaming Responses for LLMs: Implementing Server-Sent Events

Quantization Methods for LLMs: GPTQ, AWQ, and BitsAndBytes

Running LLMs on Kubernetes: Production Deployment Guide