Implement semantic search using text embeddings for more relevant results than keyword matching.
Code Snippet
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list[float]:
"""Generate embedding for text using OpenAI."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Calculate cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Index documents
documents = ["Python is great for data science", "JavaScript powers the web"]
doc_embeddings = [get_embedding(doc) for doc in documents]
# Search
query = "machine learning programming"
query_embedding = get_embedding(query)
# Find most similar
similarities = [cosine_similarity(query_embedding, de) for de in doc_embeddings]
best_match_idx = np.argmax(similarities)
print(f"Best match: {documents[best_match_idx]}")
Why This Helps
- Finds semantically similar content, not just keyword matches
- Works across languages and synonyms
- Foundation for RAG applications
How to Test
- Test with synonyms and paraphrases
- Compare results to keyword search
When to Use
Search functionality, recommendation systems, document similarity, RAG pipelines.
Performance/Security Notes
Use vector databases (Pinecone, Weaviate) for production. Cache embeddings for performance.
References
Try this tip in your next project and share your results in the comments!
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.