Getting Started with Microsoft Foundry Local: Run AI Models On-Device Without the Cloud

Introduction: Microsoft Foundry Local brings the power of Azure AI Foundry directly to your local device, enabling you to run state-of-the-art AI models without cloud dependencies. Announced at Microsoft Build 2025 and continuously enhanced since, Foundry Local represents a paradigm shift in how developers can build AI-powered applications—with complete data privacy, zero API costs, and offline capability. Whether you’re building on Windows, macOS, Linux, or even Android, Foundry Local provides a unified SDK that abstracts hardware complexity and delivers consistent performance across platforms. This guide covers everything from installation to building production-ready local AI applications.

Microsoft Foundry Local Architecture — Microsoft Foundry Local: On-Device AI Inference Architecture

What is Foundry Local?

Foundry Local is Microsoft’s on-device AI runtime that enables developers to run large language models (LLMs) and other AI models locally on consumer hardware. It’s designed to work seamlessly with the broader Azure AI Foundry ecosystem, allowing you to develop locally and deploy to the cloud when needed—or keep everything on-device for maximum privacy and control.

Capabilities and Features

Foundry Local provides comprehensive capabilities for local AI development:

Cross-Platform Support: Windows 10/11, macOS (Intel and Apple Silicon), Linux, and Android (preview)
Hardware Acceleration: Automatic detection and optimization for CPU, NVIDIA GPU (CUDA), AMD GPU (DirectML), and NPU
Model Catalog: Pre-optimized models including Phi-4, Phi-3.5, Llama 3.2, Mistral, and more
OpenAI-Compatible API: Drop-in replacement for OpenAI API calls—switch between local and cloud with minimal code changes
Automatic Model Management: Download, cache, and update models automatically
Memory Optimization: Intelligent memory management for running large models on consumer hardware
Multi-Modal Support: Text, vision, and speech capabilities with supported models
Offline Operation: Full functionality without internet connectivity after initial model download
VS Code Integration: AI Toolkit extension for seamless development experience
ONNX Runtime Backend: Optimized inference engine for maximum performance

Getting Started

Install Foundry Local and run your first local AI model:

# Install Foundry Local CLI
# Windows (PowerShell)
winget install Microsoft.FoundryLocal

# macOS (Homebrew)
brew install microsoft/foundry/foundry-local

# Linux (apt)
curl -fsSL https://aka.ms/foundry-local-install | bash

# Verify installation
foundry --version

# List available models
foundry model list

# Download a model (Phi-4 recommended for getting started)
foundry model download phi-4

# Start the local inference server
foundry service start

# The server runs on http://localhost:5272 by default

Using the Python SDK

The Python SDK provides a simple, OpenAI-compatible interface:

# Install the SDK
pip install foundry-local-sdk

# Basic chat completion
from foundry_local import FoundryLocalClient

# Initialize client (connects to local server)
client = FoundryLocalClient()

# Simple chat completion
response = client.chat.completions.create(
    model="phi-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

# Streaming response for real-time output
stream = client.chat.completions.create(
    model="phi-4",
    messages=[
        {"role": "user", "content": "Write a short poem about coding."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

OpenAI API Compatibility

Foundry Local is fully compatible with the OpenAI Python SDK, making migration seamless:

from openai import OpenAI

# Point OpenAI client to local Foundry server
client = OpenAI(
    base_url="http://localhost:5272/v1",
    api_key="not-needed"  # No API key required for local
)

# Use exactly like OpenAI API
response = client.chat.completions.create(
    model="phi-4",
    messages=[
        {"role": "system", "content": "You are a coding assistant."},
        {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
    ]
)

print(response.choices[0].message.content)

# Easy switching between local and cloud
import os

def get_ai_client():
    """Return local or cloud client based on environment."""
    if os.getenv("USE_LOCAL_AI", "true").lower() == "true":
        return OpenAI(
            base_url="http://localhost:5272/v1",
            api_key="not-needed"
        )
    else:
        return OpenAI()  # Uses OPENAI_API_KEY env var

client = get_ai_client()

Working with Different Models

from foundry_local import FoundryLocalClient

client = FoundryLocalClient()

# List downloaded models
models = client.models.list()
for model in models.data:
    print(f"Model: {model.id}, Size: {model.size_gb}GB")

# Download a new model programmatically
client.models.download("llama-3.2-3b")

# Use specific model for different tasks
# Phi-4 for general reasoning
reasoning_response = client.chat.completions.create(
    model="phi-4",
    messages=[{"role": "user", "content": "Solve: If x + 5 = 12, what is x?"}]
)

# Llama for creative writing
creative_response = client.chat.completions.create(
    model="llama-3.2-3b",
    messages=[{"role": "user", "content": "Write a haiku about programming."}],
    temperature=0.9
)

# Phi-3.5-vision for image understanding (if available)
vision_response = client.chat.completions.create(
    model="phi-3.5-vision",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}}
            ]
        }
    ]
)

Building a Local RAG Application

from foundry_local import FoundryLocalClient
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize clients
llm_client = FoundryLocalClient()
embedder = SentenceTransformer('all-MiniLM-L6-v2')  # Local embeddings

# Sample knowledge base
documents = [
    "Foundry Local runs AI models on your device without cloud dependencies.",
    "Phi-4 is a 3.8 billion parameter model optimized for reasoning tasks.",
    "The OpenAI-compatible API allows easy migration from cloud to local.",
    "Hardware acceleration supports NVIDIA CUDA, AMD DirectML, and NPUs.",
    "Models are cached locally after first download for offline use."
]

# Create embeddings for documents
doc_embeddings = embedder.encode(documents)

def retrieve_context(query: str, top_k: int = 3) -> str:
    """Retrieve relevant documents for a query."""
    query_embedding = embedder.encode([query])[0]
    
    # Calculate cosine similarity
    similarities = np.dot(doc_embeddings, query_embedding) / (
        np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
    )
    
    # Get top-k documents
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    return "\n".join([documents[i] for i in top_indices])

def rag_query(question: str) -> str:
    """Answer questions using RAG with local models."""
    context = retrieve_context(question)
    
    response = llm_client.chat.completions.create(
        model="phi-4",
        messages=[
            {
                "role": "system",
                "content": f"Answer based on this context:\n{context}\n\nBe concise and accurate."
            },
            {"role": "user", "content": question}
        ],
        temperature=0.3
    )
    
    return response.choices[0].message.content

# Example usage
answer = rag_query("What hardware does Foundry Local support?")
print(answer)

CLI Commands Reference

# Model Management
foundry model list                    # List available models
foundry model list --downloaded       # List downloaded models
foundry model download phi-4          # Download a model
foundry model delete llama-3.2-1b     # Delete a model
foundry model info phi-4              # Show model details

# Service Management
foundry service start                 # Start inference server
foundry service start --port 8080     # Custom port
foundry service start --gpu           # Force GPU acceleration
foundry service stop                  # Stop the server
foundry service status                # Check server status

# Interactive Chat
foundry chat phi-4                    # Start interactive chat
foundry chat phi-4 --system "You are a Python expert"

# Configuration
foundry config set cache-dir /path/to/cache
foundry config set default-model phi-4
foundry config show

# Benchmarking
foundry benchmark phi-4               # Run performance benchmark
foundry benchmark phi-4 --prompt "Hello world"

Benefits of Local AI

Privacy and Data Security: Your data never leaves your device. This is critical for healthcare, finance, legal, and any application handling sensitive information. No data is sent to external servers, ensuring complete confidentiality.

Cost Efficiency: Zero API costs after initial setup. No per-token charges, no rate limits, no monthly bills. For high-volume applications, the savings can be substantial—running thousands of inferences daily costs nothing beyond electricity.

Low Latency: No network round-trips mean faster responses. Local inference typically achieves 50-200ms latency compared to 500ms-2s for cloud APIs. This enables real-time applications like live coding assistants and interactive chatbots.

Offline Capability: Full functionality without internet connectivity. Perfect for field applications, air-gapped environments, or simply working on a plane. Once models are downloaded, everything works offline.

Predictable Performance: No variability from shared cloud infrastructure. Your local hardware delivers consistent performance without the unpredictability of cloud load balancing.

Hardware Requirements

Model	Min RAM	Recommended RAM	Storage	GPU (Optional)
Phi-4 (3.8B)	8GB	16GB	~8GB	4GB+ VRAM
Phi-3.5-mini (3.8B)	8GB	16GB	~7GB	4GB+ VRAM
Llama 3.2 1B	4GB	8GB	~2GB	2GB+ VRAM
Llama 3.2 3B	8GB	16GB	~6GB	4GB+ VRAM
Mistral 7B	16GB	32GB	~14GB	8GB+ VRAM

When to Use Foundry Local

Ideal use cases:

Applications handling sensitive or regulated data (HIPAA, GDPR, financial)
Offline or air-gapped environments
High-volume inference where API costs would be prohibitive
Development and prototyping without API key management
Edge deployments and embedded systems
Real-time applications requiring low latency
Educational environments teaching AI without cloud dependencies

Consider cloud alternatives when:

You need the largest models (GPT-4, Claude 3.5 Opus) that don’t fit locally
Hardware constraints prevent acceptable performance
You need guaranteed uptime and enterprise SLAs
Multi-region deployment is required

References and Documentation

Official Documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
Getting Started Guide: https://learn.microsoft.com/en-us/windows/ai/foundry-local/get-started
GitHub Samples: https://aka.ms/foundrylocalsamples
Foundry Local SDK: https://aka.ms/foundrylocalSDK
Microsoft Foundry Blog: https://devblogs.microsoft.com/foundry/

Conclusion

Microsoft Foundry Local democratizes access to powerful AI models by bringing them directly to your device. The combination of cross-platform support, OpenAI API compatibility, and automatic hardware optimization makes it remarkably easy to build AI applications that run entirely locally. For developers concerned about data privacy, API costs, or offline capability, Foundry Local provides a compelling alternative to cloud-based AI services. As models continue to become more efficient and hardware more capable, local AI inference will only become more practical—and Foundry Local positions you to take advantage of this trend today.

Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in

Code, Cloud & Context

Categories

Archives

A sample text widget