Introduction: Google’s Gemini API represents a significant leap in multimodal AI capabilities. Launched in December 2023, Gemini models are natively multimodal, trained from the ground up to understand and generate text, images, audio, and video. With context windows up to 2 million tokens and native Google Search grounding, Gemini offers unique capabilities for building sophisticated AI applications. This guide covers everything from basic text generation to advanced multimodal workflows and function calling.

Capabilities and Features
The Google Gemini API provides powerful multimodal capabilities:
- Gemini 1.5 Pro: Most capable model with 2M token context window
- Gemini 1.5 Flash: Fast and cost-effective for high-volume tasks
- Native Multimodal: Process text, images, video, and audio in single requests
- 2M Token Context: Process entire codebases, hour-long videos, or thousands of documents
- Function Calling: Define custom tools for agentic applications
- Google Search Grounding: Ground responses in real-time search results
- Code Execution: Run Python code in sandboxed environment
- Structured Output: JSON mode for reliable structured responses
- Safety Settings: Configurable content filtering
- Caching: Context caching for cost optimization on repeated queries
Getting Started
Install the Google Generative AI SDK:
# Install the SDK
pip install google-generativeai
# Set your API key
export GOOGLE_API_KEY="your-api-key"
# Or configure in code
import google.generativeai as genai
genai.configure(api_key="your-api-key")
Basic Text Generation
Create your first Gemini application:
import google.generativeai as genai
genai.configure(api_key="your-api-key")
# Initialize the model
model = genai.GenerativeModel("gemini-1.5-pro")
# Simple generation
response = model.generate_content("Explain microservices architecture in simple terms")
print(response.text)
# With generation config
response = model.generate_content(
"Write a Python function to calculate fibonacci numbers",
generation_config=genai.GenerationConfig(
temperature=0.2,
top_p=0.8,
top_k=40,
max_output_tokens=1024
)
)
print(response.text)
# Multi-turn chat
chat = model.start_chat(history=[])
response = chat.send_message("What are the key principles of clean code?")
print(response.text)
response = chat.send_message("Can you give me examples in Python?")
print(response.text)
# Access chat history
for message in chat.history:
print(f"{message.role}: {message.parts[0].text[:100]}...")
Multimodal Processing
Process images, video, and audio with Gemini:
import google.generativeai as genai
from pathlib import Path
genai.configure(api_key="your-api-key")
model = genai.GenerativeModel("gemini-1.5-pro")
# Image analysis
image_path = Path("architecture_diagram.png")
image_data = image_path.read_bytes()
response = model.generate_content([
"Analyze this architecture diagram and explain the data flow:",
{"mime_type": "image/png", "data": image_data}
])
print(response.text)
# Multiple images comparison
image1 = Path("design_v1.png").read_bytes()
image2 = Path("design_v2.png").read_bytes()
response = model.generate_content([
"Compare these two UI designs:",
{"mime_type": "image/png", "data": image1},
{"mime_type": "image/png", "data": image2},
"Which design is better for user experience and why?"
])
print(response.text)
# Video analysis (upload first for large files)
video_file = genai.upload_file("product_demo.mp4")
# Wait for processing
import time
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = genai.get_file(video_file.name)
response = model.generate_content([
video_file,
"Summarize this product demo video and list the key features shown"
])
print(response.text)
# Audio transcription and analysis
audio_file = genai.upload_file("meeting_recording.mp3")
response = model.generate_content([
audio_file,
"Transcribe this meeting and extract action items"
])
print(response.text)
Function Calling
Build agentic applications with custom tools:
import google.generativeai as genai
genai.configure(api_key="your-api-key")
# Define tools
def get_weather(location: str, unit: str = "celsius") -> dict:
"""Get current weather for a location."""
return {"location": location, "temperature": 22, "unit": unit, "condition": "sunny"}
def search_products(query: str, max_price: float = None) -> list:
"""Search product catalog."""
return [
{"name": f"{query} Pro", "price": 99.99},
{"name": f"{query} Basic", "price": 49.99}
]
def send_email(to: str, subject: str, body: str) -> dict:
"""Send an email."""
return {"status": "sent", "to": to, "subject": subject}
# Create model with tools
model = genai.GenerativeModel(
"gemini-1.5-pro",
tools=[get_weather, search_products, send_email]
)
chat = model.start_chat(enable_automatic_function_calling=True)
# The model will automatically call functions as needed
response = chat.send_message(
"What's the weather in Tokyo? Also find me some laptop products under $100"
)
print(response.text)
# Manual function calling for more control
model_manual = genai.GenerativeModel(
"gemini-1.5-pro",
tools=[get_weather, search_products]
)
response = model_manual.generate_content(
"Check weather in London and search for headphones"
)
# Process function calls
for part in response.parts:
if fn := part.function_call:
print(f"Function: {fn.name}")
print(f"Args: {dict(fn.args)}")
# Execute function
if fn.name == "get_weather":
result = get_weather(**dict(fn.args))
elif fn.name == "search_products":
result = search_products(**dict(fn.args))
# Send result back
response = model_manual.generate_content([
response.candidates[0].content,
genai.protos.Content(
parts=[genai.protos.Part(
function_response=genai.protos.FunctionResponse(
name=fn.name,
response={"result": result}
)
)]
)
])
print(response.text)
Google Search Grounding
import google.generativeai as genai
genai.configure(api_key="your-api-key")
# Enable Google Search grounding
model = genai.GenerativeModel(
"gemini-1.5-pro",
tools=[genai.protos.Tool(google_search_retrieval=genai.protos.GoogleSearchRetrieval())]
)
# Query with real-time search
response = model.generate_content(
"What are the latest developments in quantum computing this week?"
)
print(response.text)
# Access grounding metadata
if response.candidates[0].grounding_metadata:
for chunk in response.candidates[0].grounding_metadata.grounding_chunks:
print(f"Source: {chunk.web.uri}")
Context Caching for Cost Optimization
import google.generativeai as genai
from google.generativeai import caching
import datetime
genai.configure(api_key="your-api-key")
# Upload large document
document = genai.upload_file("large_codebase.txt")
# Create cache
cache = caching.CachedContent.create(
model="gemini-1.5-pro",
display_name="codebase-cache",
contents=[document],
ttl=datetime.timedelta(hours=1)
)
# Use cached content for multiple queries
model = genai.GenerativeModel.from_cached_content(cache)
# These queries use the cached context (much cheaper)
response1 = model.generate_content("What design patterns are used in this codebase?")
response2 = model.generate_content("Find potential security vulnerabilities")
response3 = model.generate_content("Suggest refactoring improvements")
# Delete cache when done
cache.delete()
Benchmarks and Performance
Gemini model performance characteristics:
| Model | Input Cost | Output Cost | Context Window | Speed |
|---|---|---|---|---|
| Gemini 1.5 Pro | $1.25/M tokens | $5/M tokens | 2M tokens | ~60 tokens/s |
| Gemini 1.5 Flash | $0.075/M tokens | $0.30/M tokens | 1M tokens | ~150 tokens/s |
| Cached Content | $0.315/M tokens | Same | 2M tokens | Same |
| Video Processing | 263 tokens/sec | N/A | ~1hr video | Varies |
| Audio Processing | 32 tokens/sec | N/A | ~9.5hrs | Varies |
When to Use Gemini
Best suited for:
- Multimodal applications (text + images + video + audio)
- Long context processing (2M tokens for entire codebases)
- Applications requiring real-time search grounding
- Video and audio analysis at scale
- Cost-sensitive applications (Flash model is very affordable)
- Google Cloud ecosystem integration
Consider alternatives when:
- Need image generation (use DALL-E or Imagen)
- Require fine-tuning (limited options currently)
- Building with Azure ecosystem (use Azure OpenAI)
- Need specialized coding models (consider Claude or GPT-4)
References and Documentation
- Official Documentation: https://ai.google.dev/docs
- API Reference: https://ai.google.dev/api
- Python SDK: https://github.com/google/generative-ai-python
- Cookbook: https://github.com/google-gemini/cookbook
- AI Studio: https://aistudio.google.com/
Conclusion
Google’s Gemini API stands out for its native multimodal capabilities and industry-leading context window. The ability to process hour-long videos, analyze audio recordings, and work with 2 million tokens of context opens possibilities that other models simply cannot match. Combined with Google Search grounding for real-time information and context caching for cost optimization, Gemini offers a compelling platform for building sophisticated AI applications. For teams working with multimodal data or requiring massive context windows, Gemini represents the current state of the art.
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.