June 2024 « Code, Cloud & Context

Multi-Modal AI: Building Applications with Vision, Audio, and Text

Introduction: Multi-modal AI combines text, images, audio, and video understanding in a single model. GPT-4V, Claude 3, and Gemini can analyze images, extract text from screenshots, understand charts, and reason about visual content. This guide covers building multi-modal applications: image analysis and description, document understanding with vision, combining OCR with LLM reasoning, audio transcription and analysis, and building applications that seamlessly handle multiple input types. These patterns unlock use cases that were impossible with text-only models.

Image Analysis with GPT-4V

from openai import OpenAI
import base64
from pathlib import Path

client = OpenAI()

def encode_image(image_path: str) -> str:
    """Encode image to base64."""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def analyze_image(
    image_path: str,
    prompt: str = "Describe this image in detail.",
    model: str = "gpt-4o"
) -> str:
    """Analyze an image with GPT-4V."""
    
    # Determine media type
    suffix = Path(image_path).suffix.lower()
    media_types = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".gif": "image/gif",
        ".webp": "image/webp"
    }
    media_type = media_types.get(suffix, "image/jpeg")
    
    # Encode image
    base64_image = encode_image(image_path)
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:{media_type};base64,{base64_image}",
                            "detail": "high"  # "low", "high", or "auto"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    
    return response.choices[0].message.content

def analyze_image_url(
    image_url: str,
    prompt: str = "Describe this image."
) -> str:
    """Analyze an image from URL."""
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_url}
                    }
                ]
            }
        ]
    )
    
    return response.choices[0].message.content

# Usage
description = analyze_image(
    "product_photo.jpg",
    "Describe this product image for an e-commerce listing. Include color, material, and key features."
)
print(description)

Multiple Image Comparison

def compare_images(
    image_paths: list[str],
    prompt: str = "Compare these images and describe the differences."
) -> str:
    """Compare multiple images."""
    
    content = [{"type": "text", "text": prompt}]
    
    for path in image_paths:
        base64_image = encode_image(path)
        suffix = Path(path).suffix.lower()
        media_type = "image/jpeg" if suffix in [".jpg", ".jpeg"] else "image/png"
        
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:{media_type};base64,{base64_image}"
            }
        })
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
        max_tokens=1500
    )
    
    return response.choices[0].message.content

# Usage - Compare before/after images
comparison = compare_images(
    ["before.jpg", "after.jpg"],
    "Compare these before and after images. What changes were made?"
)

# Usage - Product comparison
comparison = compare_images(
    ["product_a.jpg", "product_b.jpg", "product_c.jpg"],
    "Compare these three products. Create a comparison table with features, pros, and cons."
)

Document Understanding

from pydantic import BaseModel
from typing import Optional
import json

class ExtractedDocument(BaseModel):
    document_type: str
    title: Optional[str]
    date: Optional[str]
    key_fields: dict
    tables: list[dict]
    summary: str

def extract_document_data(
    image_path: str,
    document_type: str = "auto"
) -> ExtractedDocument:
    """Extract structured data from document image."""
    
    prompt = f"""Analyze this document image and extract all relevant information.

Document type hint: {document_type}

Extract:
1. Document type (invoice, receipt, form, contract, etc.)
2. Title or header
3. Date if present
4. All key fields and their values
5. Any tables with their data
6. Brief summary

Return as JSON with schema:
{{
    "document_type": "string",
    "title": "string or null",
    "date": "string or null",
    "key_fields": {{"field_name": "value"}},
    "tables": [{{"headers": [], "rows": [[]]}}],
    "summary": "string"
}}"""
    
    base64_image = encode_image(image_path)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )
    
    data = json.loads(response.choices[0].message.content)
    return ExtractedDocument(**data)

def process_invoice(image_path: str) -> dict:
    """Extract invoice-specific data."""
    
    prompt = """Extract invoice data from this image.

Return JSON with:
{
    "invoice_number": "string",
    "invoice_date": "YYYY-MM-DD",
    "due_date": "YYYY-MM-DD or null",
    "vendor": {"name": "", "address": ""},
    "customer": {"name": "", "address": ""},
    "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
    "subtotal": 0,
    "tax": 0,
    "total": 0,
    "currency": "USD"
}"""
    
    base64_image = encode_image(image_path)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)

# Usage
invoice_data = process_invoice("invoice_scan.png")
print(f"Invoice #{invoice_data['invoice_number']}")
print(f"Total: {invoice_data['currency']} {invoice_data['total']}")

Chart and Graph Analysis

def analyze_chart(
    image_path: str,
    questions: list[str] = None
) -> dict:
    """Analyze a chart or graph image."""
    
    base_prompt = """Analyze this chart/graph image.

Extract:
1. Chart type (bar, line, pie, scatter, etc.)
2. Title and axis labels
3. Data series and their values (estimate if needed)
4. Key trends and insights
5. Any notable outliers or patterns"""
    
    if questions:
        base_prompt += "\n\nAlso answer these specific questions:\n"
        for i, q in enumerate(questions, 1):
            base_prompt += f"{i}. {q}\n"
    
    base_prompt += """

Return JSON:
{
    "chart_type": "string",
    "title": "string",
    "x_axis": "string",
    "y_axis": "string",
    "data_series": [{"name": "", "values": []}],
    "insights": ["string"],
    "answers": ["string"] // if questions provided
}"""
    
    base64_image = encode_image(image_path)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": base_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)

# Usage
chart_analysis = analyze_chart(
    "sales_chart.png",
    questions=[
        "What month had the highest sales?",
        "What is the overall trend?",
        "Are there any seasonal patterns?"
    ]
)

print(f"Chart type: {chart_analysis['chart_type']}")
for insight in chart_analysis['insights']:
    print(f"- {insight}")

Audio Transcription and Analysis

def transcribe_audio(
    audio_path: str,
    language: str = None,
    prompt: str = None
) -> dict:
    """Transcribe audio using Whisper."""
    
    with open(audio_path, "rb") as audio_file:
        kwargs = {"model": "whisper-1", "file": audio_file}
        
        if language:
            kwargs["language"] = language
        if prompt:
            kwargs["prompt"] = prompt  # Helps with domain-specific terms
        
        response = client.audio.transcriptions.create(**kwargs)
    
    return {"text": response.text}

def transcribe_with_timestamps(audio_path: str) -> dict:
    """Transcribe with word-level timestamps."""
    
    with open(audio_path, "rb") as audio_file:
        response = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            response_format="verbose_json",
            timestamp_granularities=["word", "segment"]
        )
    
    return {
        "text": response.text,
        "segments": response.segments,
        "words": response.words
    }

def analyze_audio_content(audio_path: str, analysis_type: str = "summary") -> str:
    """Transcribe and analyze audio content."""
    
    # First transcribe
    transcription = transcribe_audio(audio_path)
    text = transcription["text"]
    
    # Then analyze with LLM
    prompts = {
        "summary": f"Summarize this transcript in 3-5 bullet points:\n\n{text}",
        "action_items": f"Extract action items and next steps from this meeting transcript:\n\n{text}",
        "sentiment": f"Analyze the sentiment and tone of this conversation:\n\n{text}",
        "key_topics": f"Identify the main topics discussed in this transcript:\n\n{text}"
    }
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompts.get(analysis_type, prompts["summary"])}]
    )
    
    return response.choices[0].message.content

# Usage
transcript = transcribe_audio("meeting.mp3")
print(f"Transcript: {transcript['text'][:500]}...")

action_items = analyze_audio_content("meeting.mp3", "action_items")
print(f"Action items:\n{action_items}")

Multi-Modal RAG

from dataclasses import dataclass
from typing import Union
from enum import Enum

class ContentType(str, Enum):
    TEXT = "text"
    IMAGE = "image"
    AUDIO = "audio"

@dataclass
class MultiModalDocument:
    id: str
    content_type: ContentType
    content: Union[str, bytes]
    metadata: dict
    embedding: list[float] = None

class MultiModalRAG:
    """RAG system supporting text, images, and audio."""
    
    def __init__(self):
        self.documents: list[MultiModalDocument] = []
    
    def _get_text_embedding(self, text: str) -> list[float]:
        """Get embedding for text."""
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return response.data[0].embedding
    
    def _describe_image(self, image_path: str) -> str:
        """Get text description of image for embedding."""
        return analyze_image(
            image_path,
            "Describe this image in detail for search indexing. Include all visible text, objects, colors, and context."
        )
    
    def _transcribe_audio(self, audio_path: str) -> str:
        """Get text from audio for embedding."""
        result = transcribe_audio(audio_path)
        return result["text"]
    
    def add_document(
        self,
        doc_id: str,
        content_type: ContentType,
        content_path: str,
        metadata: dict = None
    ):
        """Add a document of any type."""
        
        # Convert to text for embedding
        if content_type == ContentType.TEXT:
            with open(content_path) as f:
                text = f.read()
        elif content_type == ContentType.IMAGE:
            text = self._describe_image(content_path)
        elif content_type == ContentType.AUDIO:
            text = self._transcribe_audio(content_path)
        
        # Get embedding
        embedding = self._get_text_embedding(text[:8000])
        
        doc = MultiModalDocument(
            id=doc_id,
            content_type=content_type,
            content=text,
            metadata=metadata or {},
            embedding=embedding
        )
        
        self.documents.append(doc)
    
    def search(self, query: str, k: int = 5) -> list[MultiModalDocument]:
        """Search across all document types."""
        
        query_embedding = self._get_text_embedding(query)
        
        # Calculate similarities
        import numpy as np
        
        scored = []
        for doc in self.documents:
            similarity = np.dot(query_embedding, doc.embedding) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(doc.embedding)
            )
            scored.append((doc, similarity))
        
        # Sort by similarity
        scored.sort(key=lambda x: x[1], reverse=True)
        
        return [doc for doc, _ in scored[:k]]
    
    def query(self, question: str, k: int = 3) -> str:
        """Query with multi-modal context."""
        
        # Retrieve relevant documents
        docs = self.search(question, k=k)
        
        # Build context
        context_parts = []
        for doc in docs:
            prefix = f"[{doc.content_type.value.upper()}]"
            context_parts.append(f"{prefix}: {doc.content[:2000]}")
        
        context = "\n\n".join(context_parts)
        
        # Generate answer
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": f"Answer based on this context:\n\n{context}"
                },
                {"role": "user", "content": question}
            ]
        )
        
        return response.choices[0].message.content

# Usage
rag = MultiModalRAG()

# Add different content types
rag.add_document("doc1", ContentType.TEXT, "report.txt")
rag.add_document("img1", ContentType.IMAGE, "diagram.png")
rag.add_document("audio1", ContentType.AUDIO, "meeting.mp3")

# Query across all types
answer = rag.query("What were the main points discussed about the architecture?")

Production Multi-Modal Service

from fastapi import FastAPI, UploadFile, File, Form
from pydantic import BaseModel
from typing import Optional
import tempfile
import os

app = FastAPI()

class AnalysisResponse(BaseModel):
    content_type: str
    analysis: dict
    text_content: Optional[str]

@app.post("/analyze/image", response_model=AnalysisResponse)
async def analyze_image_endpoint(
    file: UploadFile = File(...),
    prompt: str = Form(default="Describe this image in detail.")
):
    """Analyze an uploaded image."""
    
    # Save temporarily
    with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name
    
    try:
        description = analyze_image(tmp_path, prompt)
        
        return AnalysisResponse(
            content_type="image",
            analysis={"description": description},
            text_content=description
        )
    finally:
        os.unlink(tmp_path)

@app.post("/analyze/document", response_model=AnalysisResponse)
async def analyze_document_endpoint(
    file: UploadFile = File(...),
    document_type: str = Form(default="auto")
):
    """Extract data from document image."""
    
    with tempfile.NamedTemporaryFile(delete=False, suffix=".png") as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name
    
    try:
        extracted = extract_document_data(tmp_path, document_type)
        
        return AnalysisResponse(
            content_type="document",
            analysis=extracted.model_dump(),
            text_content=extracted.summary
        )
    finally:
        os.unlink(tmp_path)

@app.post("/analyze/audio", response_model=AnalysisResponse)
async def analyze_audio_endpoint(
    file: UploadFile = File(...),
    analysis_type: str = Form(default="summary")
):
    """Transcribe and analyze audio."""
    
    with tempfile.NamedTemporaryFile(delete=False, suffix=".mp3") as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = tmp.name
    
    try:
        transcript = transcribe_audio(tmp_path)
        analysis = analyze_audio_content(tmp_path, analysis_type)
        
        return AnalysisResponse(
            content_type="audio",
            analysis={
                "transcript": transcript["text"],
                "analysis": analysis
            },
            text_content=transcript["text"]
        )
    finally:
        os.unlink(tmp_path)

References

GPT-4 Vision: https://platform.openai.com/docs/guides/vision
Whisper API: https://platform.openai.com/docs/guides/speech-to-text
Claude Vision: https://docs.anthropic.com/claude/docs/vision
Gemini Multi-Modal: https://ai.google.dev/docs/multimodal_concepts

Conclusion

Multi-modal AI opens new possibilities for applications that understand the world beyond text. Use vision models for document processing, product analysis, and chart understanding. Combine audio transcription with LLM analysis for meeting summaries and content extraction. Build multi-modal RAG systems that search across text, images, and audio. The key is converting all modalities to a common representation (text or embeddings) for unified processing. As multi-modal models improve, expect even tighter integration between modalities and new capabilities like video understanding and real-time audio conversation.

Building LLM-Powered CLI Tools: From Terminal to AI Assistant

Introduction: Command-line tools are the developer’s natural habitat. Adding LLM capabilities to CLI tools creates powerful utilities for code generation, documentation, data transformation, and automation. Unlike web apps, CLI tools are fast to build, easy to integrate into existing workflows, and perfect for power users who live in the terminal. This guide covers building production-quality LLM-powered CLI tools using Python: argument parsing with Click and Typer, streaming output with Rich, configuration management, and patterns for common use cases like code explanation, commit message generation, and interactive chat.

LLM CLI Tools — LLM CLI: From Terminal Input to Formatted Output

Basic CLI with Typer

# pip install typer rich openai

import typer
from rich.console import Console
from rich.markdown import Markdown
from openai import OpenAI

app = typer.Typer(help="AI-powered CLI tools")
console = Console()
client = OpenAI()

@app.command()
def ask(
    question: str = typer.Argument(..., help="Question to ask the AI"),
    model: str = typer.Option("gpt-4o-mini", "--model", "-m", help="Model to use"),
    system: str = typer.Option("You are a helpful assistant.", "--system", "-s"),
):
    """Ask a question and get an AI response."""
    
    with console.status("[bold green]Thinking..."):
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": question}
            ]
        )
    
    answer = response.choices[0].message.content
    console.print(Markdown(answer))

@app.command()
def chat(
    model: str = typer.Option("gpt-4o", "--model", "-m"),
):
    """Start an interactive chat session."""
    
    console.print("[bold blue]Starting chat. Type 'exit' to quit.[/bold blue]\n")
    
    messages = []
    
    while True:
        try:
            user_input = console.input("[bold green]You:[/bold green] ")
        except (KeyboardInterrupt, EOFError):
            break
        
        if user_input.lower() in ("exit", "quit", "q"):
            break
        
        messages.append({"role": "user", "content": user_input})
        
        console.print("[bold blue]AI:[/bold blue] ", end="")
        
        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True
        )
        
        full_response = ""
        for chunk in stream:
            if chunk.choices[0].delta.content:
                content = chunk.choices[0].delta.content
                console.print(content, end="")
                full_response += content
        
        console.print("\n")
        messages.append({"role": "assistant", "content": full_response})

if __name__ == "__main__":
    app()

Code Explanation Tool

import typer
from pathlib import Path
from rich.console import Console
from rich.syntax import Syntax
from rich.panel import Panel
from openai import OpenAI

app = typer.Typer()
console = Console()
client = OpenAI()

@app.command()
def explain(
    file_path: Path = typer.Argument(..., help="Path to code file"),
    detailed: bool = typer.Option(False, "--detailed", "-d", help="Detailed explanation"),
    language: str = typer.Option(None, "--lang", "-l", help="Override language detection"),
):
    """Explain what a code file does."""
    
    if not file_path.exists():
        console.print(f"[red]Error: File not found: {file_path}[/red]")
        raise typer.Exit(1)
    
    code = file_path.read_text()
    
    # Detect language from extension
    lang = language or file_path.suffix.lstrip(".")
    lang_map = {"py": "python", "js": "javascript", "ts": "typescript", "rb": "ruby"}
    lang = lang_map.get(lang, lang)
    
    # Show the code
    console.print(Panel(
        Syntax(code, lang, theme="monokai", line_numbers=True),
        title=f"[bold]{file_path.name}[/bold]"
    ))
    
    # Build prompt
    detail_level = "detailed, line-by-line" if detailed else "concise"
    
    prompt = f"""Explain this {lang} code in a {detail_level} manner:

```{lang}
{code}
```

Focus on:
1. What the code does overall
2. Key functions/classes and their purposes
3. Any notable patterns or techniques used
4. Potential issues or improvements"""
    
    console.print("\n[bold blue]Explanation:[/bold blue]\n")
    
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert code reviewer. Explain code clearly and concisely."},
            {"role": "user", "content": prompt}
        ],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            console.print(chunk.choices[0].delta.content, end="")
    
    console.print("\n")

@app.command()
def review(
    file_path: Path = typer.Argument(..., help="Path to code file"),
    focus: str = typer.Option("all", "--focus", "-f", help="Focus area: security, performance, style, all"),
):
    """Review code for issues and improvements."""
    
    code = file_path.read_text()
    lang = file_path.suffix.lstrip(".")
    
    focus_prompts = {
        "security": "Focus on security vulnerabilities, injection risks, and unsafe patterns.",
        "performance": "Focus on performance issues, inefficiencies, and optimization opportunities.",
        "style": "Focus on code style, readability, and best practices.",
        "all": "Review for security, performance, style, and general code quality."
    }
    
    prompt = f"""Review this code:

```{lang}
{code}
```

{focus_prompts.get(focus, focus_prompts["all"])}

Format your response as:
## Issues Found
- List each issue with severity (High/Medium/Low)

## Recommendations
- Specific improvements with code examples"""
    
    with console.status("[bold green]Reviewing code..."):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a senior code reviewer. Be thorough but constructive."},
                {"role": "user", "content": prompt}
            ]
        )
    
    from rich.markdown import Markdown
    console.print(Markdown(response.choices[0].message.content))

Git Commit Message Generator

import subprocess
import typer
from rich.console import Console
from openai import OpenAI

app = typer.Typer()
console = Console()
client = OpenAI()

def get_git_diff(staged_only: bool = True) -> str:
    """Get git diff."""
    cmd = ["git", "diff", "--cached"] if staged_only else ["git", "diff"]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout

def get_recent_commits(n: int = 5) -> str:
    """Get recent commit messages for style reference."""
    result = subprocess.run(
        ["git", "log", f"-{n}", "--oneline"],
        capture_output=True, text=True
    )
    return result.stdout

@app.command()
def commit(
    staged: bool = typer.Option(True, "--staged/--all", help="Use staged changes only"),
    conventional: bool = typer.Option(True, "--conventional/--simple", help="Use conventional commits"),
    auto: bool = typer.Option(False, "--auto", "-a", help="Auto-commit without confirmation"),
):
    """Generate a commit message from git diff."""
    
    diff = get_git_diff(staged)
    
    if not diff:
        console.print("[yellow]No changes to commit.[/yellow]")
        raise typer.Exit(0)
    
    # Truncate very long diffs
    if len(diff) > 10000:
        diff = diff[:10000] + "\n... (truncated)"
    
    recent = get_recent_commits(5)
    
    style_guide = """Use conventional commit format:
(): 

[optional body]

Types: feat, fix, docs, style, refactor, test, chore
Keep the first line under 72 characters.""" if conventional else "Write a clear, concise commit message."
    
    prompt = f"""Generate a commit message for these changes:

```diff
{diff}
```

Recent commits for style reference:
{recent}

{style_guide}

Return ONLY the commit message, no explanation."""
    
    with console.status("[bold green]Generating commit message..."):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a git commit message generator. Write clear, professional commit messages."},
                {"role": "user", "content": prompt}
            ]
        )
    
    message = response.choices[0].message.content.strip()
    
    console.print("\n[bold blue]Generated commit message:[/bold blue]")
    console.print(f"\n{message}\n")
    
    if auto:
        do_commit = True
    else:
        do_commit = typer.confirm("Use this commit message?")
    
    if do_commit:
        result = subprocess.run(
            ["git", "commit", "-m", message],
            capture_output=True, text=True
        )
        
        if result.returncode == 0:
            console.print("[green]Committed successfully![/green]")
        else:
            console.print(f"[red]Commit failed: {result.stderr}[/red]")
    else:
        console.print("[yellow]Commit cancelled.[/yellow]")

@app.command()
def pr(
    base: str = typer.Option("main", "--base", "-b", help="Base branch"),
):
    """Generate a PR description from commits."""
    
    # Get commits not in base
    result = subprocess.run(
        ["git", "log", f"{base}..HEAD", "--oneline"],
        capture_output=True, text=True
    )
    commits = result.stdout
    
    # Get full diff
    result = subprocess.run(
        ["git", "diff", f"{base}...HEAD"],
        capture_output=True, text=True
    )
    diff = result.stdout[:15000]  # Truncate
    
    prompt = f"""Generate a PR description for these changes:

Commits:
{commits}

Diff summary:
```diff
{diff}
```

Format:
## Summary
Brief overview of changes

## Changes
- List of specific changes

## Testing
How to test these changes"""
    
    with console.status("[bold green]Generating PR description..."):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
    
    from rich.markdown import Markdown
    console.print(Markdown(response.choices[0].message.content))

Configuration Management

import os
from pathlib import Path
from dataclasses import dataclass
import json
import typer
from rich.console import Console

@dataclass
class Config:
    api_key: str = ""
    default_model: str = "gpt-4o-mini"
    temperature: float = 0.7
    max_tokens: int = 2000
    
    @classmethod
    def load(cls) -> "Config":
        """Load config from file and environment."""
        config_path = Path.home() / ".config" / "llm-cli" / "config.json"
        
        config = cls()
        
        # Load from file
        if config_path.exists():
            data = json.loads(config_path.read_text())
            config.default_model = data.get("default_model", config.default_model)
            config.temperature = data.get("temperature", config.temperature)
            config.max_tokens = data.get("max_tokens", config.max_tokens)
        
        # Environment overrides
        config.api_key = os.environ.get("OPENAI_API_KEY", config.api_key)
        
        if env_model := os.environ.get("LLM_MODEL"):
            config.default_model = env_model
        
        return config
    
    def save(self):
        """Save config to file."""
        config_path = Path.home() / ".config" / "llm-cli" / "config.json"
        config_path.parent.mkdir(parents=True, exist_ok=True)
        
        data = {
            "default_model": self.default_model,
            "temperature": self.temperature,
            "max_tokens": self.max_tokens
        }
        
        config_path.write_text(json.dumps(data, indent=2))

# CLI with config
app = typer.Typer()
console = Console()

@app.command()
def config_show():
    """Show current configuration."""
    cfg = Config.load()
    
    console.print("[bold]Current Configuration:[/bold]")
    console.print(f"  Model: {cfg.default_model}")
    console.print(f"  Temperature: {cfg.temperature}")
    console.print(f"  Max Tokens: {cfg.max_tokens}")
    console.print(f"  API Key: {'[set]' if cfg.api_key else '[not set]'}")

@app.command()
def config_set(
    model: str = typer.Option(None, "--model", "-m"),
    temperature: float = typer.Option(None, "--temperature", "-t"),
    max_tokens: int = typer.Option(None, "--max-tokens"),
):
    """Update configuration."""
    cfg = Config.load()
    
    if model:
        cfg.default_model = model
    if temperature is not None:
        cfg.temperature = temperature
    if max_tokens:
        cfg.max_tokens = max_tokens
    
    cfg.save()
    console.print("[green]Configuration saved.[/green]")

# Use config in commands
@app.command()
def ask(question: str):
    """Ask with configured defaults."""
    cfg = Config.load()
    
    from openai import OpenAI
    client = OpenAI(api_key=cfg.api_key) if cfg.api_key else OpenAI()
    
    response = client.chat.completions.create(
        model=cfg.default_model,
        messages=[{"role": "user", "content": question}],
        temperature=cfg.temperature,
        max_tokens=cfg.max_tokens
    )
    
    console.print(response.choices[0].message.content)

Pipe-Friendly CLI

import sys
import typer
from rich.console import Console

app = typer.Typer()
console = Console()

def read_stdin() -> str:
    """Read from stdin if available."""
    if not sys.stdin.isatty():
        return sys.stdin.read()
    return ""

@app.command()
def transform(
    instruction: str = typer.Argument(..., help="Transformation instruction"),
    input_text: str = typer.Option(None, "--input", "-i", help="Input text (or use stdin)"),
):
    """Transform text using AI. Supports piping."""
    
    # Get input from argument, option, or stdin
    text = input_text or read_stdin()
    
    if not text:
        console.print("[red]No input provided. Use --input or pipe text.[/red]")
        raise typer.Exit(1)
    
    from openai import OpenAI
    client = OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Transform the input text according to: {instruction}. Output only the transformed text."},
            {"role": "user", "content": text}
        ]
    )
    
    # Output to stdout (no formatting for pipe compatibility)
    print(response.choices[0].message.content)

@app.command()
def summarize(
    length: str = typer.Option("medium", "--length", "-l", help="short, medium, or long"),
):
    """Summarize text from stdin."""
    
    text = read_stdin()
    
    if not text:
        console.print("[red]Pipe text to summarize.[/red]")
        raise typer.Exit(1)
    
    length_tokens = {"short": 50, "medium": 150, "long": 300}
    
    from openai import OpenAI
    client = OpenAI()
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Summarize the text in about {length_tokens[length]} words."},
            {"role": "user", "content": text}
        ]
    )
    
    print(response.choices[0].message.content)

# Usage examples:
# cat file.txt | llm summarize
# echo "Hello world" | llm transform "translate to French"
# curl https://example.com | llm summarize --length short

References

Typer: https://typer.tiangolo.com/
Rich: https://rich.readthedocs.io/
Click: https://click.palletsprojects.com/
llm CLI by Simon Willison: https://llm.datasette.io/

Conclusion

LLM-powered CLI tools bring AI capabilities directly into developer workflows. The terminal is where developers spend their time, and well-designed CLI tools integrate seamlessly with existing processes. Use Typer for clean argument parsing and Rich for beautiful output. Support both interactive and pipe-friendly modes. Implement streaming for responsive feedback on long generations. Add configuration management so users can customize defaults. The patterns here—code explanation, commit message generation, text transformation—are just starting points. Any repetitive text task is a candidate for an LLM CLI tool. Build tools that solve your own pain points first, then share them with your team. The best CLI tools are the ones that become muscle memory—so simple and useful that you reach for them without thinking.

Searching in

Code, Cloud & Context

Categories

Archives

A sample text widget

Multi-Modal AI: Building Applications with Vision, Audio, and Text

Image Analysis with GPT-4V

Multiple Image Comparison

Document Understanding

Chart and Graph Analysis

Audio Transcription and Analysis

Multi-Modal RAG

Production Multi-Modal Service

References

Conclusion

Building LLM-Powered CLI Tools: From Terminal to AI Assistant

Basic CLI with Typer

Code Explanation Tool

Git Commit Message Generator

Configuration Management

Pipe-Friendly CLI

References

Conclusion

Recent Posts

Blog Roll

Meta