Google Gemini API: Building Multimodal AI Applications with 2M Token Context

Introduction: Google’s Gemini API represents a significant leap in multimodal AI capabilities. Launched in December 2023, Gemini models are natively multimodal, trained from the ground up to understand and generate text, images, audio, and video. With context windows up to 2 million tokens and native Google Search grounding, Gemini offers unique capabilities for building sophisticated AI applications. This guide covers everything from basic text generation to advanced multimodal workflows and function calling.

Google Gemini API Architecture — Google Gemini: Natively Multimodal AI with 2M Token Context

Capabilities and Features

The Google Gemini API provides powerful multimodal capabilities:

Gemini 1.5 Pro: Most capable model with 2M token context window
Gemini 1.5 Flash: Fast and cost-effective for high-volume tasks
Native Multimodal: Process text, images, video, and audio in single requests
2M Token Context: Process entire codebases, hour-long videos, or thousands of documents
Function Calling: Define custom tools for agentic applications
Google Search Grounding: Ground responses in real-time search results
Code Execution: Run Python code in sandboxed environment
Structured Output: JSON mode for reliable structured responses
Safety Settings: Configurable content filtering
Caching: Context caching for cost optimization on repeated queries

Getting Started

Install the Google Generative AI SDK:

# Install the SDK
pip install google-generativeai

# Set your API key
export GOOGLE_API_KEY="your-api-key"

# Or configure in code
import google.generativeai as genai
genai.configure(api_key="your-api-key")

Basic Text Generation

Create your first Gemini application:

import google.generativeai as genai

genai.configure(api_key="your-api-key")

# Initialize the model
model = genai.GenerativeModel("gemini-1.5-pro")

# Simple generation
response = model.generate_content("Explain microservices architecture in simple terms")
print(response.text)

# With generation config
response = model.generate_content(
    "Write a Python function to calculate fibonacci numbers",
    generation_config=genai.GenerationConfig(
        temperature=0.2,
        top_p=0.8,
        top_k=40,
        max_output_tokens=1024
    )
)
print(response.text)

# Multi-turn chat
chat = model.start_chat(history=[])

response = chat.send_message("What are the key principles of clean code?")
print(response.text)

response = chat.send_message("Can you give me examples in Python?")
print(response.text)

# Access chat history
for message in chat.history:
    print(f"{message.role}: {message.parts[0].text[:100]}...")

Multimodal Processing

Process images, video, and audio with Gemini:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")

# Image analysis
image_path = Path("architecture_diagram.png")
image_data = image_path.read_bytes()

response = model.generate_content([
    "Analyze this architecture diagram and explain the data flow:",
    {"mime_type": "image/png", "data": image_data}
])
print(response.text)

# Multiple images comparison
image1 = Path("design_v1.png").read_bytes()
image2 = Path("design_v2.png").read_bytes()

response = model.generate_content([
    "Compare these two UI designs:",
    {"mime_type": "image/png", "data": image1},
    {"mime_type": "image/png", "data": image2},
    "Which design is better for user experience and why?"
])
print(response.text)

# Video analysis (upload first for large files)
video_file = genai.upload_file("product_demo.mp4")

# Wait for processing
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

response = model.generate_content([
    video_file,
    "Summarize this product demo video and list the key features shown"
])
print(response.text)

# Audio transcription and analysis
audio_file = genai.upload_file("meeting_recording.mp3")

response = model.generate_content([
    audio_file,
    "Transcribe this meeting and extract action items"
])
print(response.text)

Function Calling

Build agentic applications with custom tools:

import google.generativeai as genai

genai.configure(api_key="your-api-key")

# Define tools
def get_weather(location: str, unit: str = "celsius") -> dict:
    """Get current weather for a location."""
    return {"location": location, "temperature": 22, "unit": unit, "condition": "sunny"}

def search_products(query: str, max_price: float = None) -> list:
    """Search product catalog."""
    return [
        {"name": f"{query} Pro", "price": 99.99},
        {"name": f"{query} Basic", "price": 49.99}
    ]

def send_email(to: str, subject: str, body: str) -> dict:
    """Send an email."""
    return {"status": "sent", "to": to, "subject": subject}

# Create model with tools
model = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[get_weather, search_products, send_email]
)

chat = model.start_chat(enable_automatic_function_calling=True)

# The model will automatically call functions as needed
response = chat.send_message(
    "What's the weather in Tokyo? Also find me some laptop products under $100"
)
print(response.text)

# Manual function calling for more control
model_manual = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[get_weather, search_products]
)

response = model_manual.generate_content(
    "Check weather in London and search for headphones"
)

# Process function calls
for part in response.parts:
    if fn := part.function_call:
        print(f"Function: {fn.name}")
        print(f"Args: {dict(fn.args)}")
        
        # Execute function
        if fn.name == "get_weather":
            result = get_weather(**dict(fn.args))
        elif fn.name == "search_products":
            result = search_products(**dict(fn.args))
        
        # Send result back
        response = model_manual.generate_content([
            response.candidates[0].content,
            genai.protos.Content(
                parts=[genai.protos.Part(
                    function_response=genai.protos.FunctionResponse(
                        name=fn.name,
                        response={"result": result}
                    )
                )]
            )
        ])
        print(response.text)

Google Search Grounding

import google.generativeai as genai

genai.configure(api_key="your-api-key")

# Enable Google Search grounding
model = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[genai.protos.Tool(google_search_retrieval=genai.protos.GoogleSearchRetrieval())]
)

# Query with real-time search
response = model.generate_content(
    "What are the latest developments in quantum computing this week?"
)

print(response.text)

# Access grounding metadata
if response.candidates[0].grounding_metadata:
    for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
        print(f"Source: {chunk.web.uri}")

Context Caching for Cost Optimization

import google.generativeai as genai
from google.generativeai import caching
import datetime

genai.configure(api_key="your-api-key")

# Upload large document
document = genai.upload_file("large_codebase.txt")

# Create cache
cache = caching.CachedContent.create(
    model="gemini-1.5-pro",
    display_name="codebase-cache",
    contents=[document],
    ttl=datetime.timedelta(hours=1)
)

# Use cached content for multiple queries
model = genai.GenerativeModel.from_cached_content(cache)

# These queries use the cached context (much cheaper)
response1 = model.generate_content("What design patterns are used in this codebase?")
response2 = model.generate_content("Find potential security vulnerabilities")
response3 = model.generate_content("Suggest refactoring improvements")

# Delete cache when done
cache.delete()

Benchmarks and Performance

Gemini model performance characteristics:

Model	Input Cost	Output Cost	Context Window	Speed
Gemini 1.5 Pro	$1.25/M tokens	$5/M tokens	2M tokens	~60 tokens/s
Gemini 1.5 Flash	$0.075/M tokens	$0.30/M tokens	1M tokens	~150 tokens/s
Cached Content	$0.315/M tokens	Same	2M tokens	Same
Video Processing	263 tokens/sec	N/A	~1hr video	Varies
Audio Processing	32 tokens/sec	N/A	~9.5hrs	Varies

When to Use Gemini

Best suited for:

Multimodal applications (text + images + video + audio)
Long context processing (2M tokens for entire codebases)
Applications requiring real-time search grounding
Video and audio analysis at scale
Cost-sensitive applications (Flash model is very affordable)
Google Cloud ecosystem integration

Consider alternatives when:

Need image generation (use DALL-E or Imagen)
Require fine-tuning (limited options currently)
Building with Azure ecosystem (use Azure OpenAI)
Need specialized coding models (consider Claude or GPT-4)

References and Documentation

Official Documentation: https://ai.google.dev/docs
API Reference: https://ai.google.dev/api
Python SDK: https://github.com/google/generative-ai-python
Cookbook: https://github.com/google-gemini/cookbook
AI Studio: https://aistudio.google.com/

Conclusion

Google’s Gemini API stands out for its native multimodal capabilities and industry-leading context window. The ability to process hour-long videos, analyze audio recordings, and work with 2 million tokens of context opens possibilities that other models simply cannot match. Combined with Google Search grounding for real-time information and context caching for cost optimization, Gemini offers a compelling platform for building sophisticated AI applications. For teams working with multimodal data or requiring massive context windows, Gemini represents the current state of the art.

Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in

Google Gemini API: Building Multimodal AI Applications with 2M Token Context

Capabilities and Features

Getting Started

Basic Text Generation

Multimodal Processing

Function Calling

Google Search Grounding

Context Caching for Cost Optimization

Benchmarks and Performance

When to Use Gemini

References and Documentation

Conclusion

Discover more from Code, Cloud & Context

Leave a Reply

Searching in

Capabilities and Features

Getting Started

Basic Text Generation

Multimodal Processing

Function Calling

Google Search Grounding

Context Caching for Cost Optimization

Benchmarks and Performance

When to Use Gemini

References and Documentation

Conclusion

Share this article

Discover more from Code, Cloud & Context

Leave a Reply