Introduction: Fine-tuning transforms a general-purpose LLM into a specialized model for your specific use case. While prompt engineering works for many applications, fine-tuning offers advantages when you need consistent formatting, domain-specific knowledge, or reduced latency from shorter prompts. This guide covers practical fine-tuning: when to fine-tune versus prompt engineer, preparing training data, running fine-tuning jobs with OpenAI and open-source models, evaluating results, and deploying fine-tuned models in production. Understanding these fundamentals helps you decide when fine-tuning is worth the investment and how to execute it effectively.

When to Fine-tune
Fine-tuning makes sense in specific scenarios. Consider fine-tuning when you need consistent output formatting that’s hard to achieve with prompts alone—like always returning valid JSON in a specific schema. Fine-tuning helps when you have domain-specific terminology or writing style that the base model doesn’t capture well. It’s valuable when you want to reduce prompt length (and thus cost and latency) by baking instructions into the model weights. However, start with prompt engineering first. Many tasks that seem to need fine-tuning can be solved with better prompts, few-shot examples, or retrieval augmentation. Fine-tuning requires quality training data, costs money, and creates a model you need to maintain. Only fine-tune when you’ve exhausted prompt-based approaches and have clear evidence that fine-tuning will help.
Data Preparation
from dataclasses import dataclass
from typing import Optional
import json
import random
@dataclass
class TrainingExample:
"""A single training example."""
system: Optional[str]
user: str
assistant: str
def to_openai_format(self) -> dict:
"""Convert to OpenAI fine-tuning format."""
messages = []
if self.system:
messages.append({"role": "system", "content": self.system})
messages.append({"role": "user", "content": self.user})
messages.append({"role": "assistant", "content": self.assistant})
return {"messages": messages}
def to_alpaca_format(self) -> dict:
"""Convert to Alpaca format for open-source fine-tuning."""
return {
"instruction": self.user,
"input": "",
"output": self.assistant
}
class DatasetBuilder:
"""Build fine-tuning datasets."""
def __init__(self):
self.examples: list[TrainingExample] = []
def add_example(
self,
user: str,
assistant: str,
system: str = None
):
"""Add a training example."""
self.examples.append(TrainingExample(
system=system,
user=user,
assistant=assistant
))
def add_from_conversations(self, conversations: list[list[dict]]):
"""Add examples from conversation logs."""
for conversation in conversations:
system = None
user_msg = None
for message in conversation:
role = message.get("role")
content = message.get("content")
if role == "system":
system = content
elif role == "user":
user_msg = content
elif role == "assistant" and user_msg:
self.add_example(user_msg, content, system)
user_msg = None
def validate(self) -> list[str]:
"""Validate dataset and return issues."""
issues = []
if len(self.examples) < 10:
issues.append(f"Too few examples: {len(self.examples)} (minimum 10)")
# Check for duplicates
seen = set()
duplicates = 0
for ex in self.examples:
key = (ex.user, ex.assistant)
if key in seen:
duplicates += 1
seen.add(key)
if duplicates > 0:
issues.append(f"Found {duplicates} duplicate examples")
# Check token lengths
for i, ex in enumerate(self.examples):
total_len = len(ex.user) + len(ex.assistant)
if ex.system:
total_len += len(ex.system)
# Rough token estimate (4 chars per token)
estimated_tokens = total_len // 4
if estimated_tokens > 4096:
issues.append(f"Example {i} may exceed token limit: ~{estimated_tokens} tokens")
return issues
def split(
self,
train_ratio: float = 0.9
) -> tuple[list[TrainingExample], list[TrainingExample]]:
"""Split into train and validation sets."""
examples = self.examples.copy()
random.shuffle(examples)
split_idx = int(len(examples) * train_ratio)
return examples[:split_idx], examples[split_idx:]
def export_openai(self, filepath: str):
"""Export to OpenAI JSONL format."""
with open(filepath, 'w') as f:
for example in self.examples:
f.write(json.dumps(example.to_openai_format()) + '\n')
def export_alpaca(self, filepath: str):
"""Export to Alpaca JSON format."""
data = [ex.to_alpaca_format() for ex in self.examples]
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
class DataAugmenter:
"""Augment training data."""
def __init__(self, client=None, model: str = "gpt-4o-mini"):
self.client = client
self.model = model
async def paraphrase(self, text: str) -> str:
"""Generate paraphrase of text."""
response = await self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"Paraphrase this text while keeping the same meaning:\n\n{text}"
}]
)
return response.choices[0].message.content
async def augment_example(
self,
example: TrainingExample,
num_variations: int = 2
) -> list[TrainingExample]:
"""Generate variations of an example."""
variations = [example]
for _ in range(num_variations):
paraphrased_user = await self.paraphrase(example.user)
variations.append(TrainingExample(
system=example.system,
user=paraphrased_user,
assistant=example.assistant
))
return variations
OpenAI Fine-tuning
from dataclasses import dataclass
from typing import Optional
import asyncio
import time
@dataclass
class FineTuneJob:
"""Fine-tuning job status."""
job_id: str
status: str
model: str
fine_tuned_model: Optional[str] = None
trained_tokens: int = 0
error: Optional[str] = None
class OpenAIFineTuner:
"""Fine-tune OpenAI models."""
def __init__(self, client):
self.client = client
async def upload_file(self, filepath: str) -> str:
"""Upload training file."""
with open(filepath, 'rb') as f:
response = await self.client.files.create(
file=f,
purpose="fine-tune"
)
return response.id
async def create_job(
self,
training_file_id: str,
model: str = "gpt-4o-mini-2024-07-18",
validation_file_id: str = None,
hyperparameters: dict = None,
suffix: str = None
) -> FineTuneJob:
"""Create a fine-tuning job."""
params = {
"training_file": training_file_id,
"model": model
}
if validation_file_id:
params["validation_file"] = validation_file_id
if hyperparameters:
params["hyperparameters"] = hyperparameters
if suffix:
params["suffix"] = suffix
response = await self.client.fine_tuning.jobs.create(**params)
return FineTuneJob(
job_id=response.id,
status=response.status,
model=model
)
async def get_job_status(self, job_id: str) -> FineTuneJob:
"""Get status of a fine-tuning job."""
response = await self.client.fine_tuning.jobs.retrieve(job_id)
return FineTuneJob(
job_id=response.id,
status=response.status,
model=response.model,
fine_tuned_model=response.fine_tuned_model,
trained_tokens=response.trained_tokens or 0,
error=response.error.message if response.error else None
)
async def wait_for_completion(
self,
job_id: str,
poll_interval: int = 60
) -> FineTuneJob:
"""Wait for job to complete."""
while True:
job = await self.get_job_status(job_id)
if job.status in ["succeeded", "failed", "cancelled"]:
return job
await asyncio.sleep(poll_interval)
async def list_jobs(self, limit: int = 10) -> list[FineTuneJob]:
"""List recent fine-tuning jobs."""
response = await self.client.fine_tuning.jobs.list(limit=limit)
return [
FineTuneJob(
job_id=job.id,
status=job.status,
model=job.model,
fine_tuned_model=job.fine_tuned_model
)
for job in response.data
]
async def cancel_job(self, job_id: str) -> FineTuneJob:
"""Cancel a running job."""
response = await self.client.fine_tuning.jobs.cancel(job_id)
return FineTuneJob(
job_id=response.id,
status=response.status,
model=response.model
)
# Usage example
async def fine_tune_openai():
"""Complete fine-tuning workflow."""
from openai import AsyncOpenAI
client = AsyncOpenAI()
fine_tuner = OpenAIFineTuner(client)
# Build dataset
builder = DatasetBuilder()
builder.add_example(
system="You are a helpful customer service agent.",
user="I want to return my order",
assistant="I'd be happy to help you with your return. Could you please provide your order number?"
)
# Add more examples...
# Validate
issues = builder.validate()
if issues:
print("Dataset issues:", issues)
return
# Export
builder.export_openai("training_data.jsonl")
# Upload
file_id = await fine_tuner.upload_file("training_data.jsonl")
print(f"Uploaded file: {file_id}")
# Create job
job = await fine_tuner.create_job(
training_file_id=file_id,
model="gpt-4o-mini-2024-07-18",
suffix="customer-service"
)
print(f"Created job: {job.job_id}")
# Wait for completion
result = await fine_tuner.wait_for_completion(job.job_id)
if result.status == "succeeded":
print(f"Fine-tuned model: {result.fine_tuned_model}")
else:
print(f"Job failed: {result.error}")
Open-Source Fine-tuning
from dataclasses import dataclass
from typing import Optional
import torch
@dataclass
class LoRAConfig:
"""Configuration for LoRA fine-tuning."""
r: int = 8 # Rank
lora_alpha: int = 16
lora_dropout: float = 0.05
target_modules: list[str] = None
def __post_init__(self):
if self.target_modules is None:
self.target_modules = ["q_proj", "v_proj"]
@dataclass
class TrainingConfig:
"""Training configuration."""
output_dir: str = "./fine_tuned_model"
num_epochs: int = 3
batch_size: int = 4
learning_rate: float = 2e-4
warmup_steps: int = 100
max_length: int = 512
gradient_accumulation_steps: int = 4
fp16: bool = True
logging_steps: int = 10
save_steps: int = 100
class HuggingFaceFineTuner:
"""Fine-tune models using Hugging Face + PEFT."""
def __init__(
self,
model_name: str,
lora_config: LoRAConfig = None,
training_config: TrainingConfig = None
):
self.model_name = model_name
self.lora_config = lora_config or LoRAConfig()
self.training_config = training_config or TrainingConfig()
self.model = None
self.tokenizer = None
def load_model(self):
"""Load base model with LoRA."""
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.tokenizer.pad_token = self.tokenizer.eos_token
# Load model in 4-bit for efficiency
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
self.model = AutoModelForCausalLM.from_pretrained(
self.model_name,
quantization_config=bnb_config,
device_map="auto"
)
# Prepare for training
self.model = prepare_model_for_kbit_training(self.model)
# Add LoRA adapters
lora_config = LoraConfig(
r=self.lora_config.r,
lora_alpha=self.lora_config.lora_alpha,
lora_dropout=self.lora_config.lora_dropout,
target_modules=self.lora_config.target_modules,
bias="none",
task_type="CAUSAL_LM"
)
self.model = get_peft_model(self.model, lora_config)
# Print trainable parameters
trainable = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
total = sum(p.numel() for p in self.model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")
def prepare_dataset(self, examples: list[TrainingExample]):
"""Prepare dataset for training."""
from datasets import Dataset
def format_example(example):
"""Format example as prompt-completion pair."""
prompt = f"### Instruction:\n{example['user']}\n\n### Response:\n"
completion = example['assistant']
return {"text": prompt + completion}
# Convert to dict format
data = [
{"user": ex.user, "assistant": ex.assistant}
for ex in examples
]
dataset = Dataset.from_list(data)
dataset = dataset.map(format_example)
return dataset
def train(self, dataset):
"""Run training."""
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
# Tokenize dataset
def tokenize(examples):
return self.tokenizer(
examples["text"],
truncation=True,
max_length=self.training_config.max_length,
padding="max_length"
)
tokenized_dataset = dataset.map(tokenize, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir=self.training_config.output_dir,
num_train_epochs=self.training_config.num_epochs,
per_device_train_batch_size=self.training_config.batch_size,
gradient_accumulation_steps=self.training_config.gradient_accumulation_steps,
learning_rate=self.training_config.learning_rate,
warmup_steps=self.training_config.warmup_steps,
fp16=self.training_config.fp16,
logging_steps=self.training_config.logging_steps,
save_steps=self.training_config.save_steps,
save_total_limit=2
)
# Data collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=self.tokenizer,
mlm=False
)
# Trainer
trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator
)
# Train
trainer.train()
# Save
self.model.save_pretrained(self.training_config.output_dir)
self.tokenizer.save_pretrained(self.training_config.output_dir)
def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
"""Generate with fine-tuned model."""
formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:\n"
inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.model.device)
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract just the response part
if "### Response:" in response:
response = response.split("### Response:")[-1].strip()
return response
Evaluation
from dataclasses import dataclass
from typing import Any, Callable
import asyncio
@dataclass
class EvalResult:
"""Evaluation result for a single example."""
input: str
expected: str
actual: str
score: float
metrics: dict
class FineTuneEvaluator:
"""Evaluate fine-tuned models."""
def __init__(
self,
base_client: Any,
fine_tuned_client: Any,
base_model: str,
fine_tuned_model: str
):
self.base_client = base_client
self.fine_tuned_client = fine_tuned_client
self.base_model = base_model
self.fine_tuned_model = fine_tuned_model
async def evaluate_example(
self,
example: TrainingExample,
metrics: list[Callable] = None
) -> tuple[EvalResult, EvalResult]:
"""Evaluate both models on an example."""
# Get base model response
base_response = await self.base_client.chat.completions.create(
model=self.base_model,
messages=[
{"role": "system", "content": example.system or ""},
{"role": "user", "content": example.user}
]
)
base_output = base_response.choices[0].message.content
# Get fine-tuned model response
ft_response = await self.fine_tuned_client.chat.completions.create(
model=self.fine_tuned_model,
messages=[
{"role": "system", "content": example.system or ""},
{"role": "user", "content": example.user}
]
)
ft_output = ft_response.choices[0].message.content
# Calculate metrics
base_metrics = self._calculate_metrics(example.assistant, base_output, metrics)
ft_metrics = self._calculate_metrics(example.assistant, ft_output, metrics)
base_result = EvalResult(
input=example.user,
expected=example.assistant,
actual=base_output,
score=base_metrics.get("overall", 0),
metrics=base_metrics
)
ft_result = EvalResult(
input=example.user,
expected=example.assistant,
actual=ft_output,
score=ft_metrics.get("overall", 0),
metrics=ft_metrics
)
return base_result, ft_result
def _calculate_metrics(
self,
expected: str,
actual: str,
metrics: list[Callable] = None
) -> dict:
"""Calculate evaluation metrics."""
results = {}
# Exact match
results["exact_match"] = 1.0 if expected.strip() == actual.strip() else 0.0
# Length ratio
if len(expected) > 0:
results["length_ratio"] = len(actual) / len(expected)
else:
results["length_ratio"] = 1.0
# Word overlap
expected_words = set(expected.lower().split())
actual_words = set(actual.lower().split())
if expected_words:
overlap = len(expected_words & actual_words) / len(expected_words)
results["word_overlap"] = overlap
else:
results["word_overlap"] = 1.0
# Overall score (average of metrics)
results["overall"] = sum(results.values()) / len(results)
return results
async def evaluate_dataset(
self,
examples: list[TrainingExample]
) -> dict:
"""Evaluate on full dataset."""
base_scores = []
ft_scores = []
for example in examples:
base_result, ft_result = await self.evaluate_example(example)
base_scores.append(base_result.score)
ft_scores.append(ft_result.score)
return {
"base_model": {
"mean_score": sum(base_scores) / len(base_scores),
"min_score": min(base_scores),
"max_score": max(base_scores)
},
"fine_tuned_model": {
"mean_score": sum(ft_scores) / len(ft_scores),
"min_score": min(ft_scores),
"max_score": max(ft_scores)
},
"improvement": (sum(ft_scores) - sum(base_scores)) / len(base_scores)
}
class ABTestEvaluator:
"""A/B test base vs fine-tuned model."""
def __init__(self, judge_client: Any, judge_model: str = "gpt-4o"):
self.judge_client = judge_client
self.judge_model = judge_model
async def compare(
self,
prompt: str,
response_a: str,
response_b: str
) -> dict:
"""Have LLM judge compare responses."""
judge_prompt = f"""Compare these two responses to the same prompt.
Prompt: {prompt}
Response A:
{response_a}
Response B:
{response_b}
Which response is better? Consider accuracy, helpfulness, and clarity.
Respond with JSON: {{"winner": "A" or "B" or "tie", "reasoning": "..."}}"""
response = await self.judge_client.chat.completions.create(
model=self.judge_model,
messages=[{"role": "user", "content": judge_prompt}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
Production Deployment
from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import Optional
import asyncio
app = FastAPI()
# Storage for jobs and models
fine_tune_jobs: dict[str, FineTuneJob] = {}
deployed_models: dict[str, str] = {} # alias -> model_id
class CreateJobRequest(BaseModel):
training_file: str
base_model: str = "gpt-4o-mini-2024-07-18"
suffix: Optional[str] = None
hyperparameters: Optional[dict] = None
class DeployRequest(BaseModel):
model_id: str
alias: str
class InferenceRequest(BaseModel):
model_alias: str
messages: list[dict]
temperature: float = 0.7
fine_tuner = None # Initialize with actual client
@app.post("/v1/fine-tune/jobs")
async def create_fine_tune_job(
request: CreateJobRequest,
background_tasks: BackgroundTasks
):
"""Create a new fine-tuning job."""
job = await fine_tuner.create_job(
training_file_id=request.training_file,
model=request.base_model,
suffix=request.suffix,
hyperparameters=request.hyperparameters
)
fine_tune_jobs[job.job_id] = job
# Monitor job in background
background_tasks.add_task(monitor_job, job.job_id)
return {
"job_id": job.job_id,
"status": job.status,
"model": job.model
}
async def monitor_job(job_id: str):
"""Monitor job until completion."""
while True:
job = await fine_tuner.get_job_status(job_id)
fine_tune_jobs[job_id] = job
if job.status in ["succeeded", "failed", "cancelled"]:
break
await asyncio.sleep(60)
@app.get("/v1/fine-tune/jobs/{job_id}")
async def get_job_status(job_id: str):
"""Get status of a fine-tuning job."""
if job_id not in fine_tune_jobs:
# Try to fetch from API
job = await fine_tuner.get_job_status(job_id)
fine_tune_jobs[job_id] = job
job = fine_tune_jobs[job_id]
return {
"job_id": job.job_id,
"status": job.status,
"model": job.model,
"fine_tuned_model": job.fine_tuned_model,
"trained_tokens": job.trained_tokens,
"error": job.error
}
@app.post("/v1/fine-tune/deploy")
async def deploy_model(request: DeployRequest):
"""Deploy a fine-tuned model with an alias."""
deployed_models[request.alias] = request.model_id
return {
"alias": request.alias,
"model_id": request.model_id,
"status": "deployed"
}
@app.post("/v1/fine-tune/inference")
async def run_inference(request: InferenceRequest):
"""Run inference on a deployed model."""
if request.model_alias not in deployed_models:
raise HTTPException(404, f"Model alias not found: {request.model_alias}")
model_id = deployed_models[request.model_alias]
# Use the fine-tuned model
from openai import AsyncOpenAI
client = AsyncOpenAI()
response = await client.chat.completions.create(
model=model_id,
messages=request.messages,
temperature=request.temperature
)
return {
"model": model_id,
"response": response.choices[0].message.content,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens
}
}
@app.get("/v1/fine-tune/models")
async def list_deployed_models():
"""List all deployed models."""
return {"models": deployed_models}
@app.get("/health")
async def health():
return {"status": "healthy"}
References
- OpenAI Fine-tuning Guide: https://platform.openai.com/docs/guides/fine-tuning
- Hugging Face PEFT: https://huggingface.co/docs/peft
- LoRA Paper: https://arxiv.org/abs/2106.09685
- QLoRA Paper: https://arxiv.org/abs/2305.14314
Conclusion
Fine-tuning is a powerful tool when used appropriately. Start by validating that fine-tuning is actually needed—many use cases work well with prompt engineering alone. When you do fine-tune, invest heavily in data quality. A small dataset of high-quality examples often outperforms a large dataset of mediocre ones. Use validation sets to detect overfitting and compare against the base model to quantify improvement. For OpenAI models, the API makes fine-tuning straightforward. For open-source models, LoRA and QLoRA enable fine-tuning on consumer hardware by training only a small number of adapter parameters. Always evaluate systematically—both with automated metrics and human judgment. Deploy fine-tuned models behind aliases so you can swap versions without changing client code. The goal is a model that consistently performs better than the base model on your specific task while maintaining general capabilities.
Discover more from Code, Cloud & Context
Subscribe to get the latest posts sent to your email.