In Part 1, we covered what Generative AI is and why it matters. Now let’s go deeper into the technology that powers most of it: Large Language Models.
You don’t need to understand every mathematical detail of transformers to build great applications. But understanding the key concepts helps you write better prompts, debug weird behavior, and make smarter architectural decisions.
Series Navigation: Part 1: GenAI Intro → Part 2: LLMs Deep Dive (You are here) → Part 3: Frameworks → Part 4: Agentic AI → Part 5: Building Agents → Part 6: Enterprise
How LLMs Actually Work (The Intuitive Version)
At its core, an LLM is a sophisticated autocomplete. Given a sequence of tokens, it predicts the most likely next token. Then it adds that token to the sequence and predicts again. Repeat until done.
That’s it. All the apparent “intelligence” emerges from doing this prediction really, really well over a massive amount of training data.
Input: "The capital of France is"
Model predicts: " Paris" (highest probability)
New sequence: "The capital of France is Paris"
Model predicts: "." (period is likely after a complete statement)
Output: "The capital of France is Paris."
The Transformer Architecture (Conceptually)
Transformers, introduced in 2017, revolutionized NLP with one key innovation: attention mechanisms. Unlike previous architectures that processed text sequentially, transformers can look at all parts of the input simultaneously and learn which parts are relevant to each other.
When processing “The cat sat on the mat because it was tired”:
- The model learns that “it” refers to “cat” (not “mat”)
- It understands “tired” relates to a living thing, reinforcing the cat connection
- All these relationships are computed in parallel
This parallelization is why transformers can scale to trillions of parameters and why GPUs are so important—they’re great at parallel computation.
Prompting: The Art of Talking to Models
Prompting is how you communicate with LLMs. It’s deceptively simple—just text—but getting good results requires understanding how models interpret instructions.
Basic Prompting Patterns
# 1. Zero-shot: Just ask directly
prompt = "Translate to French: Hello, how are you?"
# 2. Few-shot: Provide examples
prompt = """Translate to French:
English: Good morning
French: Bonjour
English: Thank you
French: Merci
English: Hello, how are you?
French:"""
# 3. Chain-of-Thought: Ask for reasoning
prompt = """Solve this step by step:
If a train travels at 60 mph for 2.5 hours, how far does it go?
Let's think through this:"""
# 4. Role-based: Set context and persona
prompt = """You are an expert Python developer with 15 years of experience.
Review this code for bugs and security issues:
def login(username, password):
query = f"SELECT * FROM users WHERE name='{username}' AND pass='{password}'"
return db.execute(query)
"""
Advanced Prompting Techniques
# Structured Output Prompting
from openai import OpenAI
import json
client = OpenAI()
def extract_entities(text: str) -> dict:
"""Extract structured data from unstructured text."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Extract entities from the text and return as JSON.
Format: {"people": [], "organizations": [], "locations": [], "dates": []}
Only return valid JSON, no other text."""
},
{"role": "user", "content": text}
],
temperature=0,
response_format={"type": "json_object"} # Enforce JSON output
)
return json.loads(response.choices[0].message.content)
# Example
text = """
Apple CEO Tim Cook announced yesterday that the company will open
a new research facility in Munich, Germany by March 2026.
The $1 billion investment will create 2,000 jobs.
"""
entities = extract_entities(text)
print(json.dumps(entities, indent=2))
# Output:
# {
# "people": ["Tim Cook"],
# "organizations": ["Apple"],
# "locations": ["Munich", "Germany"],
# "dates": ["yesterday", "March 2026"]
# }
The System Prompt: Setting the Stage
The system prompt is your most powerful tool for controlling model behavior. It sets the context, constraints, and persona for the entire conversation.
# Well-structured system prompt
system_prompt = """You are a senior software architect helping developers design systems.
## Your Expertise
- Distributed systems and microservices
- Cloud platforms (AWS, Azure, GCP)
- Database design and optimization
- Security best practices
## Response Guidelines
1. Always consider scalability, reliability, and cost
2. Provide specific technology recommendations with rationale
3. Mention potential trade-offs and alternatives
4. Use diagrams in ASCII when helpful
5. Ask clarifying questions if requirements are unclear
## Constraints
- Do not recommend deprecated technologies
- Always consider security implications
- Prefer battle-tested solutions over cutting-edge unless specifically asked
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "How should I design a notification system that needs to handle 10M users?"}
]
)
Model Capabilities Compared (August 2025)
Not all models are equal. Here’s my practical assessment based on production use:
| Capability | GPT-4o | Claude 4 Sonnet | Gemini 2.5 Pro | Llama 4 70B |
|---|---|---|---|---|
| Complex Reasoning | Excellent | Excellent | Excellent | Very Good |
| Code Generation | Excellent | Excellent | Excellent | Very Good |
| Long Documents | Good (128K) | Excellent (500K) | Excellent (2M) | Good (128K) |
| Following Instructions | Excellent | Excellent | Very Good | Very Good |
| Multimodal (Images) | Excellent | Excellent | Excellent | Good |
| Speed | Fast | Fast | Very Fast | Depends on infra |
| Cost (relative) | $$$ | $$ | $$ | $ (self-hosted) |
Working with Long Context
One of the biggest advances in 2024-2025 has been context window expansion. Here’s how to use it effectively:
# Processing a long document with Claude 4
import anthropic
client = anthropic.Anthropic()
def analyze_codebase(files: dict[str, str], question: str) -> str:
"""Analyze an entire codebase using long context."""
# Combine all files with clear separators
codebase_text = ""
for filepath, content in files.items():
codebase_text += f"\n\n=== FILE: {filepath} ===\n{content}"
response = client.messages.create(
model="claude-4-sonnet",
max_tokens=4096,
messages=[
{
"role": "user",
"content": f"""Analyze this codebase and answer the question.
CODEBASE:
{codebase_text}
QUESTION: {question}
Provide specific file references and line numbers where relevant."""
}
]
)
return response.content[0].text
# Example: Analyze a project
files = {
"src/main.py": open("src/main.py").read(),
"src/models.py": open("src/models.py").read(),
"src/api.py": open("src/api.py").read(),
# ... include all relevant files
}
analysis = analyze_codebase(files, "What are the potential security vulnerabilities?")
print(analysis)
Handling Hallucinations
LLMs hallucinate. They generate plausible-sounding but incorrect information. This is a fundamental characteristic, and while models have improved, it hasn’t been eliminated. Here’s how to mitigate it:
Hallucination Mitigation Strategies
- RAG (Retrieval-Augmented Generation): Give the model actual documents to reference. We’ll cover this in Part 3.
- Temperature = 0: Reduces randomness and hallucinations for factual tasks.
- Explicit grounding: Tell the model to only use provided information and to say “I don’t know” otherwise.
- Citation requirements: Ask the model to cite sources, then verify them.
- Multi-model verification: Run the same query through multiple models and compare.
# Grounded response pattern
def grounded_answer(context: str, question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Answer based ONLY on the provided context.
If the answer is not in the context, say "I cannot find this information in the provided context."
Always quote the relevant passage that supports your answer."""
},
{
"role": "user",
"content": f"Context: {context}\n\nQuestion: {question}"
}
],
temperature=0
)
return response.choices[0].message.content
Streaming Responses
For better UX, stream responses instead of waiting for completion:
# Streaming for real-time output
from openai import OpenAI
client = OpenAI()
def stream_response(prompt: str):
"""Stream response tokens as they're generated."""
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # Newline at end
# Usage
stream_response("Explain quantum computing in simple terms.")
# Output appears word by word as it's generated
Cost Optimization
LLM costs add up fast. Here are strategies I use:
| Strategy | Implementation | Savings |
|---|---|---|
| Model tiering | Use GPT-4o-mini/Claude Haiku for simple tasks | 10-20x cheaper |
| Caching | Cache identical queries with Redis/similar | Variable, can be huge |
| Prompt caching | Use Anthropic/OpenAI prompt caching features | Up to 90% on long prompts |
| Batching | Process multiple items in one call | Reduces overhead |
| Open source | Self-host Llama 4/Mistral for high volume | 90%+ at scale |
# Model tiering example
def smart_route(task_complexity: str, prompt: str) -> str:
"""Route to appropriate model based on task complexity."""
if task_complexity == "simple":
# Classification, simple extraction, formatting
model = "gpt-4o-mini"
elif task_complexity == "medium":
# Standard generation, summarization
model = "gpt-4o"
else:
# Complex reasoning, code generation, analysis
model = "claude-4-sonnet" # or gpt-4o depending on task
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
Key Takeaways
- LLMs are sophisticated next-token predictors—understanding this helps you work with their limitations
- System prompts are powerful—invest time in crafting them well
- Different models excel at different tasks—choose based on your specific needs
- Hallucinations are inherent—build verification into your applications
- Cost optimization matters—use model tiering, caching, and prompt caching strategically
What’s Next
In Part 3, we’ll explore the frameworks that make building LLM applications easier: LangChain, LlamaIndex, Semantic Kernel, and others. We’ll build practical RAG systems and learn when to use each framework.
References & Further Reading
- Anthropic’s Prompt Engineering Guide – docs.anthropic.com
- OpenAI Prompt Engineering Guide – platform.openai.com
- The Illustrated Transformer – jalammar.github.io – Best visual explanation
- Andrej Karpathy’s LLM Intro – YouTube – Excellent overview
- LLM Tokenizer Playground – OpenAI Tokenizer
Got prompting tips or cost optimization strategies? Share them on GitHub or in the comments.
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.