Problem: AI Generates Code That Doesn't Match Your Style

You use AI assistants for code generation, but they output generic patterns that clash with your team's established conventions. The generated code requires heavy editing to match your architecture, naming standards, and design patterns.

You'll learn:

How to prepare your codebase for fine-tuning
Which models support fine-tuning and their trade-offs
How to validate the fine-tuned model produces your style
Cost analysis and when fine-tuning makes sense

Time: 45 min | Level: Advanced

Why This Happens

Foundation models are trained on public code from GitHub, Stack Overflow, and open-source projects. They learn common patterns but have no exposure to your proprietary conventions like custom error handling, internal framework usage, or company-specific architectural decisions.

Common symptoms:

AI uses camelCase when your team mandates snake_case
Generated code imports public libraries instead of your internal packages
Architecture doesn't follow your layer separation (services, repositories, controllers)
Missing required logging, error handling, or security patterns

Solution

Step 1: Assess If Fine-Tuning Is Worth It

Fine-tuning makes sense when you have:

Consistent codebase: 50,000+ lines of code following clear patterns
High usage: Team generates 100+ code snippets per week with AI
Specialized domain: Custom frameworks, internal DSLs, or unique architecture
ROI justification: Time saved on editing > fine-tuning cost

Don't fine-tune if:

Your codebase is inconsistent or rapidly changing
You can achieve 80% accuracy with good prompting
Usage is low (< 20 generations per week)
You're using open-source patterns that models already know

Step 2: Choose Your Model and Platform

Options as of February 2026:

Model	Provider	Min Training Data	Cost	Inference Speed
GPT-4o mini	OpenAI	10 examples	$0.30/1M tokens training	Fast
GPT-4o	OpenAI	50 examples	$2.00/1M tokens training	Medium
Claude 3.5 Sonnet	Anthropic	100 examples	Contact sales	Fast
Llama 3.1 70B	Self-hosted	500+ examples	Infrastructure cost	Variable

For this guide, we'll use OpenAI's GPT-4o mini - it offers the best cost/performance for most teams and has straightforward API integration.

Step 3: Extract Training Examples

Create a dataset of input/output pairs that represent ideal code generation:

# extract_training_data.py
import os
import json
from pathlib import Path

def extract_functions(file_path):
    """Extract functions with docstrings as training pairs"""
    with open(file_path, 'r') as f:
        content = f.read()
    
    # Simple extraction - improve with AST parsing for production
    pairs = []
    lines = content.split('\n')
    
    i = 0
    while i < len(lines):
        # Find function with docstring
        if 'def ' in lines[i] and i + 1 < len(lines):
            if '"""' in lines[i + 1] or "'''" in lines[i + 1]:
                # Extract function signature
                func_signature = lines[i].strip()
                
                # Extract docstring (prompt)
                doc_start = i + 1
                doc_end = doc_start + 1
                while doc_end < len(lines) and ('"""' not in lines[doc_end] and "'''" not in lines[doc_end]):
                    doc_end += 1
                
                docstring = '\n'.join(lines[doc_start:doc_end + 1])
                
                # Extract full function (completion)
                func_end = doc_end + 1
                indent_level = len(lines[i]) - len(lines[i].lstrip())
                while func_end < len(lines):
                    if lines[func_end].strip() and not lines[func_end].startswith(' ' * indent_level + ' '):
                        break
                    func_end += 1
                
                full_function = '\n'.join(lines[i:func_end])
                
                pairs.append({
                    "messages": [
                        {"role": "system", "content": "You are a code assistant that writes code following our company's style guide."},
                        {"role": "user", "content": f"Write a function:\n{docstring}"},
                        {"role": "assistant", "content": full_function}
                    ]
                })
                
                i = func_end
        i += 1
    
    return pairs

# Scan your codebase
training_data = []
code_dir = Path("./your_codebase")

for file_path in code_dir.rglob("*.py"):
    if "test" not in str(file_path) and "vendor" not in str(file_path):
        training_data.extend(extract_functions(file_path))

# Save as JSONL (required format for OpenAI)
with open("training_data.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

print(f"Extracted {len(training_data)} training examples")

Run it:

python extract_training_data.py

Expected: Extracted 150 training examples (adjust extraction logic if you get < 50)

If it fails:

No docstrings found: Add docstrings to your key functions or extract from comments
Functions too fragmented: Use AST parsing (ast module) instead of text search
Need more examples: Include class methods, configuration patterns, API routes

Step 4: Validate Your Training Data

Before uploading, ensure your data teaches the patterns you want:

# validate_training_data.py
import json
import re

def validate_example(example):
    """Check if training example follows best practices"""
    messages = example.get("messages", [])
    
    if len(messages) != 3:
        return False, "Must have system, user, assistant messages"
    
    user_msg = messages[1]["content"]
    assistant_msg = messages[2]["content"]
    
    # Check for code in completion
    if "def " not in assistant_msg and "class " not in assistant_msg:
        return False, "Assistant message must contain actual code"
    
    # Check length (OpenAI recommends 200-2000 tokens per example)
    total_chars = len(user_msg) + len(assistant_msg)
    if total_chars < 100:
        return False, "Example too short - won't teach meaningful patterns"
    if total_chars > 8000:
        return False, "Example too long - consider splitting"
    
    # Check for proprietary patterns (customize these)
    if "your_internal_package" in assistant_msg:
        return True, "Contains internal imports"
    if re.search(r'log\.(info|error|debug)', assistant_msg):
        return True, "Uses company logging"
    
    return True, "Valid generic example"

# Validate dataset
with open("training_data.jsonl", "r") as f:
    valid_count = 0
    invalid_examples = []
    
    for i, line in enumerate(f):
        example = json.loads(line)
        is_valid, reason = validate_example(example)
        
        if is_valid:
            valid_count += 1
        else:
            invalid_examples.append((i, reason))
    
    print(f"Valid examples: {valid_count}")
    print(f"Invalid examples: {len(invalid_examples)}")
    
    if invalid_examples[:5]:
        print("\nFirst 5 issues:")
        for idx, reason in invalid_examples[:5]:
            print(f"  Line {idx}: {reason}")

Run validation:

python validate_training_data.py

You should see: At least 50 valid examples with proprietary patterns. If < 50, extract more code or lower your pattern threshold.

Step 5: Upload and Start Fine-Tuning

# fine_tune.py
import openai
import os
import time

# Set your API key
openai.api_key = os.getenv("OPENAI_API_KEY")

# Upload training file
print("Uploading training data...")
with open("training_data.jsonl", "rb") as f:
    file_response = openai.files.create(
        file=f,
        purpose="fine-tune"
    )

file_id = file_response.id
print(f"File uploaded: {file_id}")

# Create fine-tuning job
print("Starting fine-tuning job...")
job = openai.fine_tuning.jobs.create(
    training_file=file_id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,  # Start with 3, increase if underfitting
        "learning_rate_multiplier": 1.0
    },
    suffix="company-style-v1"  # Appears in model name
)

job_id = job.id
print(f"Job created: {job_id}")

# Monitor progress
print("Monitoring job status (this takes 10-30 minutes)...")
while True:
    job_status = openai.fine_tuning.jobs.retrieve(job_id)
    status = job_status.status
    
    print(f"Status: {status}")
    
    if status == "succeeded":
        model_id = job_status.fine_tuned_model
        print(f"\n✓ Fine-tuning complete!")
        print(f"Model ID: {model_id}")
        break
    elif status in ["failed", "cancelled"]:
        print(f"\n✗ Job {status}")
        print(f"Error: {job_status.error}")
        break
    
    time.sleep(60)  # Check every minute

Run fine-tuning:

export OPENAI_API_KEY="sk-..."
python fine_tune.py

Expected: Job completes in 10-30 minutes. You'll get a model ID like ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123

If it fails:

Invalid training data format: Ensure JSONL has one example per line, no trailing commas
Insufficient credits: Add payment method in OpenAI dashboard
Rate limit: Wait and retry, or use batch API for large datasets

Step 6: Test Your Fine-Tuned Model

# test_model.py
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")

# Your fine-tuned model ID from step 5
FINE_TUNED_MODEL = "ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123"

# Test prompts that should show your style
test_prompts = [
    "Create a user authentication service with error handling",
    "Write a function to validate email addresses",
    "Implement a REST API endpoint for creating orders"
]

for prompt in test_prompts:
    print(f"\n{'='*60}")
    print(f"PROMPT: {prompt}")
    print('='*60)
    
    # Get response from fine-tuned model
    response = openai.chat.completions.create(
        model=FINE_TUNED_MODEL,
        messages=[
            {"role": "system", "content": "You are a code assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3  # Lower = more consistent with training
    )
    
    code = response.choices[0].message.content
    print(code)
    
    # Check for your patterns
    checks = {
        "Uses internal imports": "your_internal_package" in code,
        "Follows naming convention": "def " in code and "_" in code,
        "Includes logging": any(x in code for x in ["log.", "logger.", "logging."]),
        "Has error handling": "try:" in code or "except" in code
    }
    
    print("\nPattern validation:")
    for check, passed in checks.items():
        print(f"  {'✓' if passed else '✗'} {check}")

Run tests:

python test_model.py

You should see: Code output matching your patterns (internal imports, naming conventions, error handling style). If < 70% of patterns appear, increase epochs to 5 and retrain.

Step 7: Compare to Base Model

Run the same tests against the base model to quantify improvement:

# compare_models.py
import openai

def test_model(model_id, prompt):
    response = openai.chat.completions.create(
        model=model_id,
        messages=[
            {"role": "system", "content": "You are a code assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

# Test both models
test_prompt = "Create a user service with database access and error handling"

base_output = test_model("gpt-4o-mini-2024-07-18", test_prompt)
finetuned_output = test_model("ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123", test_prompt)

print("BASE MODEL:")
print(base_output)
print("\n" + "="*60 + "\n")
print("FINE-TUNED MODEL:")
print(finetuned_output)

# Score both
def score_patterns(code):
    score = 0
    if "your_internal_package" in code: score += 30
    if "log." in code or "logger." in code: score += 20
    if "try:" in code and "except" in code: score += 20
    if code.count("_") > code.count("camelCase"): score += 15  # snake_case
    if "# " in code: score += 15  # Has comments
    return score

base_score = score_patterns(base_output)
tuned_score = score_patterns(finetuned_output)

print(f"\nBase model score: {base_score}/100")
print(f"Fine-tuned score: {tuned_score}/100")
print(f"Improvement: {tuned_score - base_score} points")

Expected: Fine-tuned model scores 60-80+ points, base model scores 20-40 points. If improvement is < 20 points, you need more training examples or higher quality data.

Verification

Test in production workflow:

# Integrate into your IDE/CLI tool
response = openai.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123",
    messages=[
        {"role": "system", "content": "You are a code assistant following our company style guide."},
        {"role": "user", "content": "Write a payment processing service"}
    ]
)

You should see: Generated code that requires minimal editing, uses your internal libraries, follows your error handling patterns, and matches your naming conventions.

Measure success:

Before fine-tuning: Developers spend 40% of time editing AI-generated code
After fine-tuning: Editing time drops to 10-15%
Break-even: If your team generates 200+ snippets/month, you'll recoup training costs in 2-3 months

What You Learned

Fine-tuning works best with 100+ consistent examples of your patterns
GPT-4o mini costs ~$15-30 to train on 150 examples, Claude requires sales contact
Testing against base model quantifies if fine-tuning improved output
ROI depends on usage volume - low usage teams should use better prompting instead

Limitations:

Model still needs good prompts - fine-tuning isn't magic
Patterns drift as your codebase evolves, retrain every 6 months
Doesn't learn from feedback during generation (that requires RLHF)

Tested with OpenAI GPT-4o mini (Feb 2024), Python 3.11, openai==1.12.0