January 2023 « Code, Cloud & Context

LLM Fine-tuning Fundamentals: When, Why, and How to Customize Language Models

Introduction: Fine-tuning transforms a general-purpose LLM into a specialized model for your specific use case. While prompt engineering works for many applications, fine-tuning offers advantages when you need consistent formatting, domain-specific knowledge, or reduced latency from shorter prompts. This guide covers practical fine-tuning: when to fine-tune versus prompt engineer, preparing training data, running fine-tuning jobs with OpenAI and open-source models, evaluating results, and deploying fine-tuned models in production. Understanding these fundamentals helps you decide when fine-tuning is worth the investment and how to execute it effectively.

When to Fine-tune

Fine-tuning makes sense in specific scenarios. Consider fine-tuning when you need consistent output formatting that’s hard to achieve with prompts alone—like always returning valid JSON in a specific schema. Fine-tuning helps when you have domain-specific terminology or writing style that the base model doesn’t capture well. It’s valuable when you want to reduce prompt length (and thus cost and latency) by baking instructions into the model weights. However, start with prompt engineering first. Many tasks that seem to need fine-tuning can be solved with better prompts, few-shot examples, or retrieval augmentation. Fine-tuning requires quality training data, costs money, and creates a model you need to maintain. Only fine-tune when you’ve exhausted prompt-based approaches and have clear evidence that fine-tuning will help.

Data Preparation

from dataclasses import dataclass
from typing import Optional
import json
import random

@dataclass
class TrainingExample:
    """A single training example."""
    
    system: Optional[str]
    user: str
    assistant: str
    
    def to_openai_format(self) -> dict:
        """Convert to OpenAI fine-tuning format."""
        
        messages = []
        
        if self.system:
            messages.append({"role": "system", "content": self.system})
        
        messages.append({"role": "user", "content": self.user})
        messages.append({"role": "assistant", "content": self.assistant})
        
        return {"messages": messages}
    
    def to_alpaca_format(self) -> dict:
        """Convert to Alpaca format for open-source fine-tuning."""
        
        return {
            "instruction": self.user,
            "input": "",
            "output": self.assistant
        }

class DatasetBuilder:
    """Build fine-tuning datasets."""
    
    def __init__(self):
        self.examples: list[TrainingExample] = []
    
    def add_example(
        self,
        user: str,
        assistant: str,
        system: str = None
    ):
        """Add a training example."""
        
        self.examples.append(TrainingExample(
            system=system,
            user=user,
            assistant=assistant
        ))
    
    def add_from_conversations(self, conversations: list[list[dict]]):
        """Add examples from conversation logs."""
        
        for conversation in conversations:
            system = None
            user_msg = None
            
            for message in conversation:
                role = message.get("role")
                content = message.get("content")
                
                if role == "system":
                    system = content
                elif role == "user":
                    user_msg = content
                elif role == "assistant" and user_msg:
                    self.add_example(user_msg, content, system)
                    user_msg = None
    
    def validate(self) -> list[str]:
        """Validate dataset and return issues."""
        
        issues = []
        
        if len(self.examples) < 10:
            issues.append(f"Too few examples: {len(self.examples)} (minimum 10)")
        
        # Check for duplicates
        seen = set()
        duplicates = 0
        for ex in self.examples:
            key = (ex.user, ex.assistant)
            if key in seen:
                duplicates += 1
            seen.add(key)
        
        if duplicates > 0:
            issues.append(f"Found {duplicates} duplicate examples")
        
        # Check token lengths
        for i, ex in enumerate(self.examples):
            total_len = len(ex.user) + len(ex.assistant)
            if ex.system:
                total_len += len(ex.system)
            
            # Rough token estimate (4 chars per token)
            estimated_tokens = total_len // 4
            
            if estimated_tokens > 4096:
                issues.append(f"Example {i} may exceed token limit: ~{estimated_tokens} tokens")
        
        return issues
    
    def split(
        self,
        train_ratio: float = 0.9
    ) -> tuple[list[TrainingExample], list[TrainingExample]]:
        """Split into train and validation sets."""
        
        examples = self.examples.copy()
        random.shuffle(examples)
        
        split_idx = int(len(examples) * train_ratio)
        
        return examples[:split_idx], examples[split_idx:]
    
    def export_openai(self, filepath: str):
        """Export to OpenAI JSONL format."""
        
        with open(filepath, 'w') as f:
            for example in self.examples:
                f.write(json.dumps(example.to_openai_format()) + '\n')
    
    def export_alpaca(self, filepath: str):
        """Export to Alpaca JSON format."""
        
        data = [ex.to_alpaca_format() for ex in self.examples]
        
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)

class DataAugmenter:
    """Augment training data."""
    
    def __init__(self, client=None, model: str = "gpt-4o-mini"):
        self.client = client
        self.model = model
    
    async def paraphrase(self, text: str) -> str:
        """Generate paraphrase of text."""
        
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"Paraphrase this text while keeping the same meaning:\n\n{text}"
            }]
        )
        
        return response.choices[0].message.content
    
    async def augment_example(
        self,
        example: TrainingExample,
        num_variations: int = 2
    ) -> list[TrainingExample]:
        """Generate variations of an example."""
        
        variations = [example]
        
        for _ in range(num_variations):
            paraphrased_user = await self.paraphrase(example.user)
            
            variations.append(TrainingExample(
                system=example.system,
                user=paraphrased_user,
                assistant=example.assistant
            ))
        
        return variations

OpenAI Fine-tuning

from dataclasses import dataclass
from typing import Optional
import asyncio
import time

@dataclass
class FineTuneJob:
    """Fine-tuning job status."""
    
    job_id: str
    status: str
    model: str
    fine_tuned_model: Optional[str] = None
    trained_tokens: int = 0
    error: Optional[str] = None

class OpenAIFineTuner:
    """Fine-tune OpenAI models."""
    
    def __init__(self, client):
        self.client = client
    
    async def upload_file(self, filepath: str) -> str:
        """Upload training file."""
        
        with open(filepath, 'rb') as f:
            response = await self.client.files.create(
                file=f,
                purpose="fine-tune"
            )
        
        return response.id
    
    async def create_job(
        self,
        training_file_id: str,
        model: str = "gpt-4o-mini-2024-07-18",
        validation_file_id: str = None,
        hyperparameters: dict = None,
        suffix: str = None
    ) -> FineTuneJob:
        """Create a fine-tuning job."""
        
        params = {
            "training_file": training_file_id,
            "model": model
        }
        
        if validation_file_id:
            params["validation_file"] = validation_file_id
        
        if hyperparameters:
            params["hyperparameters"] = hyperparameters
        
        if suffix:
            params["suffix"] = suffix
        
        response = await self.client.fine_tuning.jobs.create(**params)
        
        return FineTuneJob(
            job_id=response.id,
            status=response.status,
            model=model
        )
    
    async def get_job_status(self, job_id: str) -> FineTuneJob:
        """Get status of a fine-tuning job."""
        
        response = await self.client.fine_tuning.jobs.retrieve(job_id)
        
        return FineTuneJob(
            job_id=response.id,
            status=response.status,
            model=response.model,
            fine_tuned_model=response.fine_tuned_model,
            trained_tokens=response.trained_tokens or 0,
            error=response.error.message if response.error else None
        )
    
    async def wait_for_completion(
        self,
        job_id: str,
        poll_interval: int = 60
    ) -> FineTuneJob:
        """Wait for job to complete."""
        
        while True:
            job = await self.get_job_status(job_id)
            
            if job.status in ["succeeded", "failed", "cancelled"]:
                return job
            
            await asyncio.sleep(poll_interval)
    
    async def list_jobs(self, limit: int = 10) -> list[FineTuneJob]:
        """List recent fine-tuning jobs."""
        
        response = await self.client.fine_tuning.jobs.list(limit=limit)
        
        return [
            FineTuneJob(
                job_id=job.id,
                status=job.status,
                model=job.model,
                fine_tuned_model=job.fine_tuned_model
            )
            for job in response.data
        ]
    
    async def cancel_job(self, job_id: str) -> FineTuneJob:
        """Cancel a running job."""
        
        response = await self.client.fine_tuning.jobs.cancel(job_id)
        
        return FineTuneJob(
            job_id=response.id,
            status=response.status,
            model=response.model
        )

# Usage example
async def fine_tune_openai():
    """Complete fine-tuning workflow."""
    
    from openai import AsyncOpenAI
    
    client = AsyncOpenAI()
    fine_tuner = OpenAIFineTuner(client)
    
    # Build dataset
    builder = DatasetBuilder()
    builder.add_example(
        system="You are a helpful customer service agent.",
        user="I want to return my order",
        assistant="I'd be happy to help you with your return. Could you please provide your order number?"
    )
    # Add more examples...
    
    # Validate
    issues = builder.validate()
    if issues:
        print("Dataset issues:", issues)
        return
    
    # Export
    builder.export_openai("training_data.jsonl")
    
    # Upload
    file_id = await fine_tuner.upload_file("training_data.jsonl")
    print(f"Uploaded file: {file_id}")
    
    # Create job
    job = await fine_tuner.create_job(
        training_file_id=file_id,
        model="gpt-4o-mini-2024-07-18",
        suffix="customer-service"
    )
    print(f"Created job: {job.job_id}")
    
    # Wait for completion
    result = await fine_tuner.wait_for_completion(job.job_id)
    
    if result.status == "succeeded":
        print(f"Fine-tuned model: {result.fine_tuned_model}")
    else:
        print(f"Job failed: {result.error}")

Open-Source Fine-tuning

from dataclasses import dataclass
from typing import Optional
import torch

@dataclass
class LoRAConfig:
    """Configuration for LoRA fine-tuning."""
    
    r: int = 8  # Rank
    lora_alpha: int = 16
    lora_dropout: float = 0.05
    target_modules: list[str] = None
    
    def __post_init__(self):
        if self.target_modules is None:
            self.target_modules = ["q_proj", "v_proj"]

@dataclass
class TrainingConfig:
    """Training configuration."""
    
    output_dir: str = "./fine_tuned_model"
    num_epochs: int = 3
    batch_size: int = 4
    learning_rate: float = 2e-4
    warmup_steps: int = 100
    max_length: int = 512
    gradient_accumulation_steps: int = 4
    fp16: bool = True
    logging_steps: int = 10
    save_steps: int = 100

class HuggingFaceFineTuner:
    """Fine-tune models using Hugging Face + PEFT."""
    
    def __init__(
        self,
        model_name: str,
        lora_config: LoRAConfig = None,
        training_config: TrainingConfig = None
    ):
        self.model_name = model_name
        self.lora_config = lora_config or LoRAConfig()
        self.training_config = training_config or TrainingConfig()
        
        self.model = None
        self.tokenizer = None
    
    def load_model(self):
        """Load base model with LoRA."""
        
        from transformers import AutoModelForCausalLM, AutoTokenizer
        from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Load model in 4-bit for efficiency
        from transformers import BitsAndBytesConfig
        
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
        
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            quantization_config=bnb_config,
            device_map="auto"
        )
        
        # Prepare for training
        self.model = prepare_model_for_kbit_training(self.model)
        
        # Add LoRA adapters
        lora_config = LoraConfig(
            r=self.lora_config.r,
            lora_alpha=self.lora_config.lora_alpha,
            lora_dropout=self.lora_config.lora_dropout,
            target_modules=self.lora_config.target_modules,
            bias="none",
            task_type="CAUSAL_LM"
        )
        
        self.model = get_peft_model(self.model, lora_config)
        
        # Print trainable parameters
        trainable = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
        total = sum(p.numel() for p in self.model.parameters())
        print(f"Trainable parameters: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")
    
    def prepare_dataset(self, examples: list[TrainingExample]):
        """Prepare dataset for training."""
        
        from datasets import Dataset
        
        def format_example(example):
            """Format example as prompt-completion pair."""
            
            prompt = f"### Instruction:\n{example['user']}\n\n### Response:\n"
            completion = example['assistant']
            
            return {"text": prompt + completion}
        
        # Convert to dict format
        data = [
            {"user": ex.user, "assistant": ex.assistant}
            for ex in examples
        ]
        
        dataset = Dataset.from_list(data)
        dataset = dataset.map(format_example)
        
        return dataset
    
    def train(self, dataset):
        """Run training."""
        
        from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
        
        # Tokenize dataset
        def tokenize(examples):
            return self.tokenizer(
                examples["text"],
                truncation=True,
                max_length=self.training_config.max_length,
                padding="max_length"
            )
        
        tokenized_dataset = dataset.map(tokenize, batched=True)
        
        # Training arguments
        training_args = TrainingArguments(
            output_dir=self.training_config.output_dir,
            num_train_epochs=self.training_config.num_epochs,
            per_device_train_batch_size=self.training_config.batch_size,
            gradient_accumulation_steps=self.training_config.gradient_accumulation_steps,
            learning_rate=self.training_config.learning_rate,
            warmup_steps=self.training_config.warmup_steps,
            fp16=self.training_config.fp16,
            logging_steps=self.training_config.logging_steps,
            save_steps=self.training_config.save_steps,
            save_total_limit=2
        )
        
        # Data collator
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=self.tokenizer,
            mlm=False
        )
        
        # Trainer
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=tokenized_dataset,
            data_collator=data_collator
        )
        
        # Train
        trainer.train()
        
        # Save
        self.model.save_pretrained(self.training_config.output_dir)
        self.tokenizer.save_pretrained(self.training_config.output_dir)
    
    def generate(self, prompt: str, max_new_tokens: int = 256) -> str:
        """Generate with fine-tuned model."""
        
        formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:\n"
        
        inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.model.device)
        
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id
        )
        
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract just the response part
        if "### Response:" in response:
            response = response.split("### Response:")[-1].strip()
        
        return response

Evaluation

from dataclasses import dataclass
from typing import Any, Callable
import asyncio

@dataclass
class EvalResult:
    """Evaluation result for a single example."""
    
    input: str
    expected: str
    actual: str
    score: float
    metrics: dict

class FineTuneEvaluator:
    """Evaluate fine-tuned models."""
    
    def __init__(
        self,
        base_client: Any,
        fine_tuned_client: Any,
        base_model: str,
        fine_tuned_model: str
    ):
        self.base_client = base_client
        self.fine_tuned_client = fine_tuned_client
        self.base_model = base_model
        self.fine_tuned_model = fine_tuned_model
    
    async def evaluate_example(
        self,
        example: TrainingExample,
        metrics: list[Callable] = None
    ) -> tuple[EvalResult, EvalResult]:
        """Evaluate both models on an example."""
        
        # Get base model response
        base_response = await self.base_client.chat.completions.create(
            model=self.base_model,
            messages=[
                {"role": "system", "content": example.system or ""},
                {"role": "user", "content": example.user}
            ]
        )
        base_output = base_response.choices[0].message.content
        
        # Get fine-tuned model response
        ft_response = await self.fine_tuned_client.chat.completions.create(
            model=self.fine_tuned_model,
            messages=[
                {"role": "system", "content": example.system or ""},
                {"role": "user", "content": example.user}
            ]
        )
        ft_output = ft_response.choices[0].message.content
        
        # Calculate metrics
        base_metrics = self._calculate_metrics(example.assistant, base_output, metrics)
        ft_metrics = self._calculate_metrics(example.assistant, ft_output, metrics)
        
        base_result = EvalResult(
            input=example.user,
            expected=example.assistant,
            actual=base_output,
            score=base_metrics.get("overall", 0),
            metrics=base_metrics
        )
        
        ft_result = EvalResult(
            input=example.user,
            expected=example.assistant,
            actual=ft_output,
            score=ft_metrics.get("overall", 0),
            metrics=ft_metrics
        )
        
        return base_result, ft_result
    
    def _calculate_metrics(
        self,
        expected: str,
        actual: str,
        metrics: list[Callable] = None
    ) -> dict:
        """Calculate evaluation metrics."""
        
        results = {}
        
        # Exact match
        results["exact_match"] = 1.0 if expected.strip() == actual.strip() else 0.0
        
        # Length ratio
        if len(expected) > 0:
            results["length_ratio"] = len(actual) / len(expected)
        else:
            results["length_ratio"] = 1.0
        
        # Word overlap
        expected_words = set(expected.lower().split())
        actual_words = set(actual.lower().split())
        
        if expected_words:
            overlap = len(expected_words & actual_words) / len(expected_words)
            results["word_overlap"] = overlap
        else:
            results["word_overlap"] = 1.0
        
        # Overall score (average of metrics)
        results["overall"] = sum(results.values()) / len(results)
        
        return results
    
    async def evaluate_dataset(
        self,
        examples: list[TrainingExample]
    ) -> dict:
        """Evaluate on full dataset."""
        
        base_scores = []
        ft_scores = []
        
        for example in examples:
            base_result, ft_result = await self.evaluate_example(example)
            base_scores.append(base_result.score)
            ft_scores.append(ft_result.score)
        
        return {
            "base_model": {
                "mean_score": sum(base_scores) / len(base_scores),
                "min_score": min(base_scores),
                "max_score": max(base_scores)
            },
            "fine_tuned_model": {
                "mean_score": sum(ft_scores) / len(ft_scores),
                "min_score": min(ft_scores),
                "max_score": max(ft_scores)
            },
            "improvement": (sum(ft_scores) - sum(base_scores)) / len(base_scores)
        }

class ABTestEvaluator:
    """A/B test base vs fine-tuned model."""
    
    def __init__(self, judge_client: Any, judge_model: str = "gpt-4o"):
        self.judge_client = judge_client
        self.judge_model = judge_model
    
    async def compare(
        self,
        prompt: str,
        response_a: str,
        response_b: str
    ) -> dict:
        """Have LLM judge compare responses."""
        
        judge_prompt = f"""Compare these two responses to the same prompt.

Prompt: {prompt}

Response A:
{response_a}

Response B:
{response_b}

Which response is better? Consider accuracy, helpfulness, and clarity.
Respond with JSON: {{"winner": "A" or "B" or "tie", "reasoning": "..."}}"""
        
        response = await self.judge_client.chat.completions.create(
            model=self.judge_model,
            messages=[{"role": "user", "content": judge_prompt}],
            response_format={"type": "json_object"}
        )
        
        import json
        return json.loads(response.choices[0].message.content)

Production Deployment

from fastapi import FastAPI, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import Optional
import asyncio

app = FastAPI()

# Storage for jobs and models
fine_tune_jobs: dict[str, FineTuneJob] = {}
deployed_models: dict[str, str] = {}  # alias -> model_id

class CreateJobRequest(BaseModel):
    training_file: str
    base_model: str = "gpt-4o-mini-2024-07-18"
    suffix: Optional[str] = None
    hyperparameters: Optional[dict] = None

class DeployRequest(BaseModel):
    model_id: str
    alias: str

class InferenceRequest(BaseModel):
    model_alias: str
    messages: list[dict]
    temperature: float = 0.7

fine_tuner = None  # Initialize with actual client

@app.post("/v1/fine-tune/jobs")
async def create_fine_tune_job(
    request: CreateJobRequest,
    background_tasks: BackgroundTasks
):
    """Create a new fine-tuning job."""
    
    job = await fine_tuner.create_job(
        training_file_id=request.training_file,
        model=request.base_model,
        suffix=request.suffix,
        hyperparameters=request.hyperparameters
    )
    
    fine_tune_jobs[job.job_id] = job
    
    # Monitor job in background
    background_tasks.add_task(monitor_job, job.job_id)
    
    return {
        "job_id": job.job_id,
        "status": job.status,
        "model": job.model
    }

async def monitor_job(job_id: str):
    """Monitor job until completion."""
    
    while True:
        job = await fine_tuner.get_job_status(job_id)
        fine_tune_jobs[job_id] = job
        
        if job.status in ["succeeded", "failed", "cancelled"]:
            break
        
        await asyncio.sleep(60)

@app.get("/v1/fine-tune/jobs/{job_id}")
async def get_job_status(job_id: str):
    """Get status of a fine-tuning job."""
    
    if job_id not in fine_tune_jobs:
        # Try to fetch from API
        job = await fine_tuner.get_job_status(job_id)
        fine_tune_jobs[job_id] = job
    
    job = fine_tune_jobs[job_id]
    
    return {
        "job_id": job.job_id,
        "status": job.status,
        "model": job.model,
        "fine_tuned_model": job.fine_tuned_model,
        "trained_tokens": job.trained_tokens,
        "error": job.error
    }

@app.post("/v1/fine-tune/deploy")
async def deploy_model(request: DeployRequest):
    """Deploy a fine-tuned model with an alias."""
    
    deployed_models[request.alias] = request.model_id
    
    return {
        "alias": request.alias,
        "model_id": request.model_id,
        "status": "deployed"
    }

@app.post("/v1/fine-tune/inference")
async def run_inference(request: InferenceRequest):
    """Run inference on a deployed model."""
    
    if request.model_alias not in deployed_models:
        raise HTTPException(404, f"Model alias not found: {request.model_alias}")
    
    model_id = deployed_models[request.model_alias]
    
    # Use the fine-tuned model
    from openai import AsyncOpenAI
    client = AsyncOpenAI()
    
    response = await client.chat.completions.create(
        model=model_id,
        messages=request.messages,
        temperature=request.temperature
    )
    
    return {
        "model": model_id,
        "response": response.choices[0].message.content,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens
        }
    }

@app.get("/v1/fine-tune/models")
async def list_deployed_models():
    """List all deployed models."""
    
    return {"models": deployed_models}

@app.get("/health")
async def health():
    return {"status": "healthy"}

References

OpenAI Fine-tuning Guide: https://platform.openai.com/docs/guides/fine-tuning
Hugging Face PEFT: https://huggingface.co/docs/peft
LoRA Paper: https://arxiv.org/abs/2106.09685
QLoRA Paper: https://arxiv.org/abs/2305.14314

Conclusion

Fine-tuning is a powerful tool when used appropriately. Start by validating that fine-tuning is actually needed—many use cases work well with prompt engineering alone. When you do fine-tune, invest heavily in data quality. A small dataset of high-quality examples often outperforms a large dataset of mediocre ones. Use validation sets to detect overfitting and compare against the base model to quantify improvement. For OpenAI models, the API makes fine-tuning straightforward. For open-source models, LoRA and QLoRA enable fine-tuning on consumer hardware by training only a small number of adapter parameters. Always evaluate systematically—both with automated metrics and human judgment. Deploy fine-tuned models behind aliases so you can swap versions without changing client code. The goal is a model that consistently performs better than the base model on your specific task while maintaining general capabilities.

Searching in

Code, Cloud & Context

Categories

Archives

A sample text widget