Google Gemini API: Building Multimodal AI Applications with 2M Token Context

Introduction: Google’s Gemini API represents a significant leap in multimodal AI capabilities. Launched in December 2023, Gemini models are natively multimodal, trained from the ground up to understand and generate text, images, audio, and video. With context windows up to 2 million tokens and native Google Search grounding, Gemini offers unique capabilities for building sophisticated AI applications. This guide covers everything from basic text generation to advanced multimodal workflows and function calling.

Google Gemini API Architecture
Google Gemini: Natively Multimodal AI with 2M Token Context

Capabilities and Features

The Google Gemini API provides powerful multimodal capabilities:

  • Gemini 1.5 Pro: Most capable model with 2M token context window
  • Gemini 1.5 Flash: Fast and cost-effective for high-volume tasks
  • Native Multimodal: Process text, images, video, and audio in single requests
  • 2M Token Context: Process entire codebases, hour-long videos, or thousands of documents
  • Function Calling: Define custom tools for agentic applications
  • Google Search Grounding: Ground responses in real-time search results
  • Code Execution: Run Python code in sandboxed environment
  • Structured Output: JSON mode for reliable structured responses
  • Safety Settings: Configurable content filtering
  • Caching: Context caching for cost optimization on repeated queries

Getting Started

Install the Google Generative AI SDK:

# Install the SDK
pip install google-generativeai

# Set your API key
export GOOGLE_API_KEY="your-api-key"

# Or configure in code
import google.generativeai as genai
genai.configure(api_key="your-api-key")

Basic Text Generation

Create your first Gemini application:

import google.generativeai as genai

genai.configure(api_key="your-api-key")

# Initialize the model
model = genai.GenerativeModel("gemini-1.5-pro")

# Simple generation
response = model.generate_content("Explain microservices architecture in simple terms")
print(response.text)

# With generation config
response = model.generate_content(
    "Write a Python function to calculate fibonacci numbers",
    generation_config=genai.GenerationConfig(
        temperature=0.2,
        top_p=0.8,
        top_k=40,
        max_output_tokens=1024
    )
)
print(response.text)

# Multi-turn chat
chat = model.start_chat(history=[])

response = chat.send_message("What are the key principles of clean code?")
print(response.text)

response = chat.send_message("Can you give me examples in Python?")
print(response.text)

# Access chat history
for message in chat.history:
    print(f"{message.role}: {message.parts[0].text[:100]}...")

Multimodal Processing

Process images, video, and audio with Gemini:

import google.generativeai as genai
from pathlib import Path

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")

# Image analysis
image_path = Path("architecture_diagram.png")
image_data = image_path.read_bytes()

response = model.generate_content([
    "Analyze this architecture diagram and explain the data flow:",
    {"mime_type": "image/png", "data": image_data}
])
print(response.text)

# Multiple images comparison
image1 = Path("design_v1.png").read_bytes()
image2 = Path("design_v2.png").read_bytes()

response = model.generate_content([
    "Compare these two UI designs:",
    {"mime_type": "image/png", "data": image1},
    {"mime_type": "image/png", "data": image2},
    "Which design is better for user experience and why?"
])
print(response.text)

# Video analysis (upload first for large files)
video_file = genai.upload_file("product_demo.mp4")

# Wait for processing
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

response = model.generate_content([
    video_file,
    "Summarize this product demo video and list the key features shown"
])
print(response.text)

# Audio transcription and analysis
audio_file = genai.upload_file("meeting_recording.mp3")

response = model.generate_content([
    audio_file,
    "Transcribe this meeting and extract action items"
])
print(response.text)

Function Calling

Build agentic applications with custom tools:

import google.generativeai as genai

genai.configure(api_key="your-api-key")

# Define tools
def get_weather(location: str, unit: str = "celsius") -> dict:
    """Get current weather for a location."""
    return {"location": location, "temperature": 22, "unit": unit, "condition": "sunny"}

def search_products(query: str, max_price: float = None) -> list:
    """Search product catalog."""
    return [
        {"name": f"{query} Pro", "price": 99.99},
        {"name": f"{query} Basic", "price": 49.99}
    ]

def send_email(to: str, subject: str, body: str) -> dict:
    """Send an email."""
    return {"status": "sent", "to": to, "subject": subject}

# Create model with tools
model = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[get_weather, search_products, send_email]
)

chat = model.start_chat(enable_automatic_function_calling=True)

# The model will automatically call functions as needed
response = chat.send_message(
    "What's the weather in Tokyo? Also find me some laptop products under $100"
)
print(response.text)

# Manual function calling for more control
model_manual = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[get_weather, search_products]
)

response = model_manual.generate_content(
    "Check weather in London and search for headphones"
)

# Process function calls
for part in response.parts:
    if fn := part.function_call:
        print(f"Function: {fn.name}")
        print(f"Args: {dict(fn.args)}")
        
        # Execute function
        if fn.name == "get_weather":
            result = get_weather(**dict(fn.args))
        elif fn.name == "search_products":
            result = search_products(**dict(fn.args))
        
        # Send result back
        response = model_manual.generate_content([
            response.candidates[0].content,
            genai.protos.Content(
                parts=[genai.protos.Part(
                    function_response=genai.protos.FunctionResponse(
                        name=fn.name,
                        response={"result": result}
                    )
                )]
            )
        ])
        print(response.text)

Google Search Grounding

import google.generativeai as genai

genai.configure(api_key="your-api-key")

# Enable Google Search grounding
model = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[genai.protos.Tool(google_search_retrieval=genai.protos.GoogleSearchRetrieval())]
)

# Query with real-time search
response = model.generate_content(
    "What are the latest developments in quantum computing this week?"
)

print(response.text)

# Access grounding metadata
if response.candidates[0].grounding_metadata:
    for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
        print(f"Source: {chunk.web.uri}")

Context Caching for Cost Optimization

import google.generativeai as genai
from google.generativeai import caching
import datetime

genai.configure(api_key="your-api-key")

# Upload large document
document = genai.upload_file("large_codebase.txt")

# Create cache
cache = caching.CachedContent.create(
    model="gemini-1.5-pro",
    display_name="codebase-cache",
    contents=[document],
    ttl=datetime.timedelta(hours=1)
)

# Use cached content for multiple queries
model = genai.GenerativeModel.from_cached_content(cache)

# These queries use the cached context (much cheaper)
response1 = model.generate_content("What design patterns are used in this codebase?")
response2 = model.generate_content("Find potential security vulnerabilities")
response3 = model.generate_content("Suggest refactoring improvements")

# Delete cache when done
cache.delete()

Benchmarks and Performance

Gemini model performance characteristics:

ModelInput CostOutput CostContext WindowSpeed
Gemini 1.5 Pro$1.25/M tokens$5/M tokens2M tokens~60 tokens/s
Gemini 1.5 Flash$0.075/M tokens$0.30/M tokens1M tokens~150 tokens/s
Cached Content$0.315/M tokensSame2M tokensSame
Video Processing263 tokens/secN/A~1hr videoVaries
Audio Processing32 tokens/secN/A~9.5hrsVaries

When to Use Gemini

Best suited for:

  • Multimodal applications (text + images + video + audio)
  • Long context processing (2M tokens for entire codebases)
  • Applications requiring real-time search grounding
  • Video and audio analysis at scale
  • Cost-sensitive applications (Flash model is very affordable)
  • Google Cloud ecosystem integration

Consider alternatives when:

  • Need image generation (use DALL-E or Imagen)
  • Require fine-tuning (limited options currently)
  • Building with Azure ecosystem (use Azure OpenAI)
  • Need specialized coding models (consider Claude or GPT-4)

References and Documentation

Conclusion

Google’s Gemini API stands out for its native multimodal capabilities and industry-leading context window. The ability to process hour-long videos, analyze audio recordings, and work with 2 million tokens of context opens possibilities that other models simply cannot match. Combined with Google Search grounding for real-time information and context caching for cost optimization, Gemini offers a compelling platform for building sophisticated AI applications. For teams working with multimodal data or requiring massive context windows, Gemini represents the current state of the art.


Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.