LlamaIndex: The Data Framework for Building Production RAG Applications

Introduction: LlamaIndex (formerly GPT Index) is the leading data framework for building LLM applications over your private data. While LangChain focuses on chains and agents, LlamaIndex specializes in data ingestion, indexing, and retrieval—the core components of Retrieval Augmented Generation (RAG). With over 160 data connectors through LlamaHub, sophisticated indexing strategies, and production-ready query engines, LlamaIndex has become the go-to choice for building knowledge-intensive AI applications. This guide covers everything from basic RAG to advanced techniques like agentic RAG and knowledge graphs.

LlamaIndex Architecture
LlamaIndex: The Data Framework for LLM Applications

Capabilities and Features

LlamaIndex provides comprehensive capabilities for building RAG applications:

  • Data Connectors: 160+ connectors via LlamaHub (databases, APIs, file formats)
  • Index Types: Vector, Summary, Tree, Keyword, and Knowledge Graph indexes
  • Query Engines: Sophisticated retrieval with re-ranking and post-processing
  • Chat Engines: Conversational interfaces with memory and context
  • Agents: ReAct agents with tool use and query planning
  • Evaluation: Built-in metrics for RAG quality assessment
  • Observability: Integration with LlamaTrace and other platforms
  • Structured Output: Pydantic integration for type-safe responses
  • Multi-Modal: Support for images, audio, and video in indexes
  • Production Ready: LlamaCloud for managed RAG infrastructure

Getting Started

Install LlamaIndex and set up your environment:

# Install core package
pip install llama-index

# Install specific integrations
pip install llama-index-llms-openai llama-index-embeddings-openai
pip install llama-index-vector-stores-chroma
pip install llama-index-readers-file

# Set environment variables
export OPENAI_API_KEY="your-api-key"

Building Your First RAG Application

Create a simple document Q&A system:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents from a directory
documents = SimpleDirectoryReader("./data").load_data()
print(f"Loaded {len(documents)} documents")

# Create vector index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="tree_summarize"
)

# Query the index
response = query_engine.query("What are the main topics covered in these documents?")
print(response)

# Access source nodes
for node in response.source_nodes:
    print(f"Score: {node.score:.3f}")
    print(f"Text: {node.text[:200]}...")
    print("---")

Advanced Indexing Strategies

LlamaIndex offers multiple index types for different use cases:

from llama_index.core import (
    VectorStoreIndex,
    SummaryIndex,
    TreeIndex,
    KeywordTableIndex,
    KnowledgeGraphIndex
)
from llama_index.core.node_parser import SentenceSplitter

# Custom node parsing
node_parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separator=" "
)

nodes = node_parser.get_nodes_from_documents(documents)

# Vector Store Index - Best for semantic similarity search
vector_index = VectorStoreIndex(nodes)

# Summary Index - Best for summarization tasks
summary_index = SummaryIndex(nodes)

# Tree Index - Best for hierarchical summarization
tree_index = TreeIndex(nodes, num_children=5)

# Keyword Table Index - Best for keyword-based retrieval
keyword_index = KeywordTableIndex(nodes)

# Knowledge Graph Index - Best for entity relationships
kg_index = KnowledgeGraphIndex(
    nodes,
    max_triplets_per_chunk=5,
    include_embeddings=True
)

# Combine indexes with routers
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector
from llama_index.core.tools import QueryEngineTool

tools = [
    QueryEngineTool.from_defaults(
        query_engine=vector_index.as_query_engine(),
        description="Best for specific factual questions"
    ),
    QueryEngineTool.from_defaults(
        query_engine=summary_index.as_query_engine(),
        description="Best for summarization and overview questions"
    ),
    QueryEngineTool.from_defaults(
        query_engine=kg_index.as_query_engine(),
        description="Best for questions about relationships between entities"
    )
]

router_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),
    query_engine_tools=tools
)

# The router automatically selects the best index
response = router_engine.query("Summarize the key findings")
print(response)

Production RAG with Vector Stores

Connect to production vector databases:

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.vector_stores.qdrant import QdrantVectorStore
import chromadb

# Chroma (local or server)
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("documents")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context
)

# Pinecone (cloud)
from pinecone import Pinecone
pc = Pinecone(api_key="your-pinecone-key")
pinecone_index = pc.Index("your-index-name")

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context
)

# Qdrant (local or cloud)
from qdrant_client import QdrantClient
qdrant_client = QdrantClient(url="http://localhost:6333")

vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name="documents"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context
)

Chat Engine with Memory

from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

# Create chat engine with memory
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    system_prompt="""You are a helpful assistant that answers questions 
    based on the provided documentation. Always cite your sources."""
)

# Interactive chat
response = chat_engine.chat("What is LlamaIndex?")
print(response)

response = chat_engine.chat("How does it compare to LangChain?")
print(response)

# Access chat history
for message in chat_engine.chat_history:
    print(f"{message.role}: {message.content[:100]}...")

Agentic RAG with Query Planning

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, FunctionTool

# Create tools from indexes
query_tool = QueryEngineTool.from_defaults(
    query_engine=index.as_query_engine(),
    name="document_search",
    description="Search through company documentation"
)

# Add custom tools
def get_current_date() -> str:
    """Get the current date."""
    from datetime import datetime
    return datetime.now().strftime("%Y-%m-%d")

def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        return str(eval(expression))
    except:
        return "Error evaluating expression"

date_tool = FunctionTool.from_defaults(fn=get_current_date)
calc_tool = FunctionTool.from_defaults(fn=calculate)

# Create ReAct agent
agent = ReActAgent.from_tools(
    tools=[query_tool, date_tool, calc_tool],
    llm=Settings.llm,
    verbose=True
)

# Agent can reason and use tools
response = agent.chat(
    "What were our Q3 revenue numbers and what's the growth rate compared to Q2?"
)
print(response)

# Sub-question query engine for complex queries
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool

# Multiple data sources
sales_tool = QueryEngineTool.from_defaults(
    query_engine=sales_index.as_query_engine(),
    description="Sales data and metrics"
)
product_tool = QueryEngineTool.from_defaults(
    query_engine=product_index.as_query_engine(),
    description="Product information and features"
)

sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[sales_tool, product_tool]
)

# Automatically breaks down complex queries
response = sub_question_engine.query(
    "Compare our top-selling product features with sales performance"
)

Evaluation and Observability

from llama_index.core.evaluation import (
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator
)

# Create evaluators
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()

# Evaluate a response
query = "What are the key features of the product?"
response = query_engine.query(query)

# Check faithfulness (is response grounded in sources?)
faithfulness_result = faithfulness_evaluator.evaluate_response(
    query=query,
    response=response
)
print(f"Faithfulness: {faithfulness_result.passing}")

# Check relevancy (is response relevant to query?)
relevancy_result = relevancy_evaluator.evaluate_response(
    query=query,
    response=response
)
print(f"Relevancy: {relevancy_result.passing}")

# Batch evaluation
from llama_index.core.evaluation import BatchEvalRunner

eval_questions = [
    "What is the pricing model?",
    "How do I get started?",
    "What integrations are available?"
]

runner = BatchEvalRunner(
    evaluators={
        "faithfulness": faithfulness_evaluator,
        "relevancy": relevancy_evaluator
    },
    workers=4
)

eval_results = await runner.aevaluate_queries(
    query_engine,
    queries=eval_questions
)

Benchmarks and Performance

LlamaIndex performance characteristics:

OperationLatency (p50)Latency (p99)Notes
Document Ingestion50ms/doc200ms/docDepends on size
Vector Query (top-5)100ms300ms+ embedding time
Query + Synthesis2-4s8sGPT-4o
Chat Engine Turn2-5s10sWith context
Agent (3 steps)8-15s30sReAct agent
Sub-Question (3 sub-q)10-20s45sParallel execution

When to Use LlamaIndex

Best suited for:

  • Building RAG applications over private data
  • Document Q&A and knowledge base systems
  • Complex data ingestion from multiple sources
  • Applications requiring sophisticated retrieval strategies
  • Knowledge graph construction and querying
  • Production RAG with evaluation and observability

Consider alternatives when:

  • Building complex agent workflows (use LangGraph)
  • Need simple chains without RAG (use LangChain)
  • Require managed RAG infrastructure (use AWS Bedrock KB or LlamaCloud)
  • Building multi-agent systems (use CrewAI or AutoGen)

References and Documentation

Conclusion

LlamaIndex has established itself as the premier framework for building RAG applications. Its focus on data ingestion, indexing, and retrieval provides capabilities that complement rather than compete with agent-focused frameworks like LangChain. The extensive data connector ecosystem through LlamaHub, combined with sophisticated index types and query engines, makes it possible to build production-grade knowledge systems with minimal boilerplate. For teams building document Q&A, knowledge bases, or any application that needs to reason over private data, LlamaIndex offers the most comprehensive and well-designed toolkit available.


Discover more from Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.