Introduction: Microsoft Foundry Local brings the power of Azure AI Foundry directly to your local device, enabling you to run state-of-the-art AI models without cloud dependencies. Announced at Microsoft Build 2025 and continuously enhanced since, Foundry Local represents a paradigm shift in how developers can build AI-powered applications—with complete data privacy, zero API costs, and offline capability. Whether you’re building on Windows, macOS, Linux, or even Android, Foundry Local provides a unified SDK that abstracts hardware complexity and delivers consistent performance across platforms. This guide covers everything from installation to building production-ready local AI applications.

What is Foundry Local?
Foundry Local is Microsoft’s on-device AI runtime that enables developers to run large language models (LLMs) and other AI models locally on consumer hardware. It’s designed to work seamlessly with the broader Azure AI Foundry ecosystem, allowing you to develop locally and deploy to the cloud when needed—or keep everything on-device for maximum privacy and control.
Capabilities and Features
Foundry Local provides comprehensive capabilities for local AI development:
- Cross-Platform Support: Windows 10/11, macOS (Intel and Apple Silicon), Linux, and Android (preview)
- Hardware Acceleration: Automatic detection and optimization for CPU, NVIDIA GPU (CUDA), AMD GPU (DirectML), and NPU
- Model Catalog: Pre-optimized models including Phi-4, Phi-3.5, Llama 3.2, Mistral, and more
- OpenAI-Compatible API: Drop-in replacement for OpenAI API calls—switch between local and cloud with minimal code changes
- Automatic Model Management: Download, cache, and update models automatically
- Memory Optimization: Intelligent memory management for running large models on consumer hardware
- Multi-Modal Support: Text, vision, and speech capabilities with supported models
- Offline Operation: Full functionality without internet connectivity after initial model download
- VS Code Integration: AI Toolkit extension for seamless development experience
- ONNX Runtime Backend: Optimized inference engine for maximum performance
Getting Started
Install Foundry Local and run your first local AI model:
# Install Foundry Local CLI
# Windows (PowerShell)
winget install Microsoft.FoundryLocal
# macOS (Homebrew)
brew install microsoft/foundry/foundry-local
# Linux (apt)
curl -fsSL https://aka.ms/foundry-local-install | bash
# Verify installation
foundry --version
# List available models
foundry model list
# Download a model (Phi-4 recommended for getting started)
foundry model download phi-4
# Start the local inference server
foundry service start
# The server runs on http://localhost:5272 by default
Using the Python SDK
The Python SDK provides a simple, OpenAI-compatible interface:
# Install the SDK
pip install foundry-local-sdk
# Basic chat completion
from foundry_local import FoundryLocalClient
# Initialize client (connects to local server)
client = FoundryLocalClient()
# Simple chat completion
response = client.chat.completions.create(
model="phi-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
# Streaming response for real-time output
stream = client.chat.completions.create(
model="phi-4",
messages=[
{"role": "user", "content": "Write a short poem about coding."}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
OpenAI API Compatibility
Foundry Local is fully compatible with the OpenAI Python SDK, making migration seamless:
from openai import OpenAI
# Point OpenAI client to local Foundry server
client = OpenAI(
base_url="http://localhost:5272/v1",
api_key="not-needed" # No API key required for local
)
# Use exactly like OpenAI API
response = client.chat.completions.create(
model="phi-4",
messages=[
{"role": "system", "content": "You are a coding assistant."},
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
]
)
print(response.choices[0].message.content)
# Easy switching between local and cloud
import os
def get_ai_client():
"""Return local or cloud client based on environment."""
if os.getenv("USE_LOCAL_AI", "true").lower() == "true":
return OpenAI(
base_url="http://localhost:5272/v1",
api_key="not-needed"
)
else:
return OpenAI() # Uses OPENAI_API_KEY env var
client = get_ai_client()
Working with Different Models
from foundry_local import FoundryLocalClient
client = FoundryLocalClient()
# List downloaded models
models = client.models.list()
for model in models.data:
print(f"Model: {model.id}, Size: {model.size_gb}GB")
# Download a new model programmatically
client.models.download("llama-3.2-3b")
# Use specific model for different tasks
# Phi-4 for general reasoning
reasoning_response = client.chat.completions.create(
model="phi-4",
messages=[{"role": "user", "content": "Solve: If x + 5 = 12, what is x?"}]
)
# Llama for creative writing
creative_response = client.chat.completions.create(
model="llama-3.2-3b",
messages=[{"role": "user", "content": "Write a haiku about programming."}],
temperature=0.9
)
# Phi-3.5-vision for image understanding (if available)
vision_response = client.chat.completions.create(
model="phi-3.5-vision",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}}
]
}
]
)
Building a Local RAG Application
from foundry_local import FoundryLocalClient
from sentence_transformers import SentenceTransformer
import numpy as np
# Initialize clients
llm_client = FoundryLocalClient()
embedder = SentenceTransformer('all-MiniLM-L6-v2') # Local embeddings
# Sample knowledge base
documents = [
"Foundry Local runs AI models on your device without cloud dependencies.",
"Phi-4 is a 3.8 billion parameter model optimized for reasoning tasks.",
"The OpenAI-compatible API allows easy migration from cloud to local.",
"Hardware acceleration supports NVIDIA CUDA, AMD DirectML, and NPUs.",
"Models are cached locally after first download for offline use."
]
# Create embeddings for documents
doc_embeddings = embedder.encode(documents)
def retrieve_context(query: str, top_k: int = 3) -> str:
"""Retrieve relevant documents for a query."""
query_embedding = embedder.encode([query])[0]
# Calculate cosine similarity
similarities = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
# Get top-k documents
top_indices = np.argsort(similarities)[-top_k:][::-1]
return "\n".join([documents[i] for i in top_indices])
def rag_query(question: str) -> str:
"""Answer questions using RAG with local models."""
context = retrieve_context(question)
response = llm_client.chat.completions.create(
model="phi-4",
messages=[
{
"role": "system",
"content": f"Answer based on this context:\n{context}\n\nBe concise and accurate."
},
{"role": "user", "content": question}
],
temperature=0.3
)
return response.choices[0].message.content
# Example usage
answer = rag_query("What hardware does Foundry Local support?")
print(answer)
CLI Commands Reference
# Model Management
foundry model list # List available models
foundry model list --downloaded # List downloaded models
foundry model download phi-4 # Download a model
foundry model delete llama-3.2-1b # Delete a model
foundry model info phi-4 # Show model details
# Service Management
foundry service start # Start inference server
foundry service start --port 8080 # Custom port
foundry service start --gpu # Force GPU acceleration
foundry service stop # Stop the server
foundry service status # Check server status
# Interactive Chat
foundry chat phi-4 # Start interactive chat
foundry chat phi-4 --system "You are a Python expert"
# Configuration
foundry config set cache-dir /path/to/cache
foundry config set default-model phi-4
foundry config show
# Benchmarking
foundry benchmark phi-4 # Run performance benchmark
foundry benchmark phi-4 --prompt "Hello world"
Benefits of Local AI
Privacy and Data Security: Your data never leaves your device. This is critical for healthcare, finance, legal, and any application handling sensitive information. No data is sent to external servers, ensuring complete confidentiality.
Cost Efficiency: Zero API costs after initial setup. No per-token charges, no rate limits, no monthly bills. For high-volume applications, the savings can be substantial—running thousands of inferences daily costs nothing beyond electricity.
Low Latency: No network round-trips mean faster responses. Local inference typically achieves 50-200ms latency compared to 500ms-2s for cloud APIs. This enables real-time applications like live coding assistants and interactive chatbots.
Offline Capability: Full functionality without internet connectivity. Perfect for field applications, air-gapped environments, or simply working on a plane. Once models are downloaded, everything works offline.
Predictable Performance: No variability from shared cloud infrastructure. Your local hardware delivers consistent performance without the unpredictability of cloud load balancing.
Hardware Requirements
| Model | Min RAM | Recommended RAM | Storage | GPU (Optional) |
|---|---|---|---|---|
| Phi-4 (3.8B) | 8GB | 16GB | ~8GB | 4GB+ VRAM |
| Phi-3.5-mini (3.8B) | 8GB | 16GB | ~7GB | 4GB+ VRAM |
| Llama 3.2 1B | 4GB | 8GB | ~2GB | 2GB+ VRAM |
| Llama 3.2 3B | 8GB | 16GB | ~6GB | 4GB+ VRAM |
| Mistral 7B | 16GB | 32GB | ~14GB | 8GB+ VRAM |
When to Use Foundry Local
Ideal use cases:
- Applications handling sensitive or regulated data (HIPAA, GDPR, financial)
- Offline or air-gapped environments
- High-volume inference where API costs would be prohibitive
- Development and prototyping without API key management
- Edge deployments and embedded systems
- Real-time applications requiring low latency
- Educational environments teaching AI without cloud dependencies
Consider cloud alternatives when:
- You need the largest models (GPT-4, Claude 3.5 Opus) that don’t fit locally
- Hardware constraints prevent acceptable performance
- You need guaranteed uptime and enterprise SLAs
- Multi-region deployment is required
References and Documentation
- Official Documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
- Getting Started Guide: https://learn.microsoft.com/en-us/windows/ai/foundry-local/get-started
- GitHub Samples: https://aka.ms/foundrylocalsamples
- Foundry Local SDK: https://aka.ms/foundrylocalSDK
- Microsoft Foundry Blog: https://devblogs.microsoft.com/foundry/
Conclusion
Microsoft Foundry Local democratizes access to powerful AI models by bringing them directly to your device. The combination of cross-platform support, OpenAI API compatibility, and automatic hardware optimization makes it remarkably easy to build AI applications that run entirely locally. For developers concerned about data privacy, API costs, or offline capability, Foundry Local provides a compelling alternative to cloud-based AI services. As models continue to become more efficient and hardware more capable, local AI inference will only become more practical—and Foundry Local positions you to take advantage of this trend today.
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.

Leave a Reply