Problem: AI Generates Code That Doesn't Match Your Style
You use AI assistants for code generation, but they output generic patterns that clash with your team's established conventions. The generated code requires heavy editing to match your architecture, naming standards, and design patterns.
You'll learn:
- How to prepare your codebase for fine-tuning
- Which models support fine-tuning and their trade-offs
- How to validate the fine-tuned model produces your style
- Cost analysis and when fine-tuning makes sense
Time: 45 min | Level: Advanced
Why This Happens
Foundation models are trained on public code from GitHub, Stack Overflow, and open-source projects. They learn common patterns but have no exposure to your proprietary conventions like custom error handling, internal framework usage, or company-specific architectural decisions.
Common symptoms:
- AI uses
camelCasewhen your team mandatessnake_case - Generated code imports public libraries instead of your internal packages
- Architecture doesn't follow your layer separation (services, repositories, controllers)
- Missing required logging, error handling, or security patterns
Solution
Step 1: Assess If Fine-Tuning Is Worth It
Fine-tuning makes sense when you have:
- Consistent codebase: 50,000+ lines of code following clear patterns
- High usage: Team generates 100+ code snippets per week with AI
- Specialized domain: Custom frameworks, internal DSLs, or unique architecture
- ROI justification: Time saved on editing > fine-tuning cost
Don't fine-tune if:
- Your codebase is inconsistent or rapidly changing
- You can achieve 80% accuracy with good prompting
- Usage is low (< 20 generations per week)
- You're using open-source patterns that models already know
Step 2: Choose Your Model and Platform
Options as of February 2026:
| Model | Provider | Min Training Data | Cost | Inference Speed |
|---|---|---|---|---|
| GPT-4o mini | OpenAI | 10 examples | $0.30/1M tokens training | Fast |
| GPT-4o | OpenAI | 50 examples | $2.00/1M tokens training | Medium |
| Claude 3.5 Sonnet | Anthropic | 100 examples | Contact sales | Fast |
| Llama 3.1 70B | Self-hosted | 500+ examples | Infrastructure cost | Variable |
For this guide, we'll use OpenAI's GPT-4o mini - it offers the best cost/performance for most teams and has straightforward API integration.
Step 3: Extract Training Examples
Create a dataset of input/output pairs that represent ideal code generation:
# extract_training_data.py
import os
import json
from pathlib import Path
def extract_functions(file_path):
"""Extract functions with docstrings as training pairs"""
with open(file_path, 'r') as f:
content = f.read()
# Simple extraction - improve with AST parsing for production
pairs = []
lines = content.split('\n')
i = 0
while i < len(lines):
# Find function with docstring
if 'def ' in lines[i] and i + 1 < len(lines):
if '"""' in lines[i + 1] or "'''" in lines[i + 1]:
# Extract function signature
func_signature = lines[i].strip()
# Extract docstring (prompt)
doc_start = i + 1
doc_end = doc_start + 1
while doc_end < len(lines) and ('"""' not in lines[doc_end] and "'''" not in lines[doc_end]):
doc_end += 1
docstring = '\n'.join(lines[doc_start:doc_end + 1])
# Extract full function (completion)
func_end = doc_end + 1
indent_level = len(lines[i]) - len(lines[i].lstrip())
while func_end < len(lines):
if lines[func_end].strip() and not lines[func_end].startswith(' ' * indent_level + ' '):
break
func_end += 1
full_function = '\n'.join(lines[i:func_end])
pairs.append({
"messages": [
{"role": "system", "content": "You are a code assistant that writes code following our company's style guide."},
{"role": "user", "content": f"Write a function:\n{docstring}"},
{"role": "assistant", "content": full_function}
]
})
i = func_end
i += 1
return pairs
# Scan your codebase
training_data = []
code_dir = Path("./your_codebase")
for file_path in code_dir.rglob("*.py"):
if "test" not in str(file_path) and "vendor" not in str(file_path):
training_data.extend(extract_functions(file_path))
# Save as JSONL (required format for OpenAI)
with open("training_data.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")
print(f"Extracted {len(training_data)} training examples")
Run it:
python extract_training_data.py
Expected: Extracted 150 training examples (adjust extraction logic if you get < 50)
If it fails:
- No docstrings found: Add docstrings to your key functions or extract from comments
- Functions too fragmented: Use AST parsing (
astmodule) instead of text search - Need more examples: Include class methods, configuration patterns, API routes
Step 4: Validate Your Training Data
Before uploading, ensure your data teaches the patterns you want:
# validate_training_data.py
import json
import re
def validate_example(example):
"""Check if training example follows best practices"""
messages = example.get("messages", [])
if len(messages) != 3:
return False, "Must have system, user, assistant messages"
user_msg = messages[1]["content"]
assistant_msg = messages[2]["content"]
# Check for code in completion
if "def " not in assistant_msg and "class " not in assistant_msg:
return False, "Assistant message must contain actual code"
# Check length (OpenAI recommends 200-2000 tokens per example)
total_chars = len(user_msg) + len(assistant_msg)
if total_chars < 100:
return False, "Example too short - won't teach meaningful patterns"
if total_chars > 8000:
return False, "Example too long - consider splitting"
# Check for proprietary patterns (customize these)
if "your_internal_package" in assistant_msg:
return True, "Contains internal imports"
if re.search(r'log\.(info|error|debug)', assistant_msg):
return True, "Uses company logging"
return True, "Valid generic example"
# Validate dataset
with open("training_data.jsonl", "r") as f:
valid_count = 0
invalid_examples = []
for i, line in enumerate(f):
example = json.loads(line)
is_valid, reason = validate_example(example)
if is_valid:
valid_count += 1
else:
invalid_examples.append((i, reason))
print(f"Valid examples: {valid_count}")
print(f"Invalid examples: {len(invalid_examples)}")
if invalid_examples[:5]:
print("\nFirst 5 issues:")
for idx, reason in invalid_examples[:5]:
print(f" Line {idx}: {reason}")
Run validation:
python validate_training_data.py
You should see: At least 50 valid examples with proprietary patterns. If < 50, extract more code or lower your pattern threshold.
Step 5: Upload and Start Fine-Tuning
# fine_tune.py
import openai
import os
import time
# Set your API key
openai.api_key = os.getenv("OPENAI_API_KEY")
# Upload training file
print("Uploading training data...")
with open("training_data.jsonl", "rb") as f:
file_response = openai.files.create(
file=f,
purpose="fine-tune"
)
file_id = file_response.id
print(f"File uploaded: {file_id}")
# Create fine-tuning job
print("Starting fine-tuning job...")
job = openai.fine_tuning.jobs.create(
training_file=file_id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3, # Start with 3, increase if underfitting
"learning_rate_multiplier": 1.0
},
suffix="company-style-v1" # Appears in model name
)
job_id = job.id
print(f"Job created: {job_id}")
# Monitor progress
print("Monitoring job status (this takes 10-30 minutes)...")
while True:
job_status = openai.fine_tuning.jobs.retrieve(job_id)
status = job_status.status
print(f"Status: {status}")
if status == "succeeded":
model_id = job_status.fine_tuned_model
print(f"\n✓ Fine-tuning complete!")
print(f"Model ID: {model_id}")
break
elif status in ["failed", "cancelled"]:
print(f"\n✗ Job {status}")
print(f"Error: {job_status.error}")
break
time.sleep(60) # Check every minute
Run fine-tuning:
export OPENAI_API_KEY="sk-..."
python fine_tune.py
Expected: Job completes in 10-30 minutes. You'll get a model ID like ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123
If it fails:
- Invalid training data format: Ensure JSONL has one example per line, no trailing commas
- Insufficient credits: Add payment method in OpenAI dashboard
- Rate limit: Wait and retry, or use batch API for large datasets
Step 6: Test Your Fine-Tuned Model
# test_model.py
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
# Your fine-tuned model ID from step 5
FINE_TUNED_MODEL = "ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123"
# Test prompts that should show your style
test_prompts = [
"Create a user authentication service with error handling",
"Write a function to validate email addresses",
"Implement a REST API endpoint for creating orders"
]
for prompt in test_prompts:
print(f"\n{'='*60}")
print(f"PROMPT: {prompt}")
print('='*60)
# Get response from fine-tuned model
response = openai.chat.completions.create(
model=FINE_TUNED_MODEL,
messages=[
{"role": "system", "content": "You are a code assistant."},
{"role": "user", "content": prompt}
],
temperature=0.3 # Lower = more consistent with training
)
code = response.choices[0].message.content
print(code)
# Check for your patterns
checks = {
"Uses internal imports": "your_internal_package" in code,
"Follows naming convention": "def " in code and "_" in code,
"Includes logging": any(x in code for x in ["log.", "logger.", "logging."]),
"Has error handling": "try:" in code or "except" in code
}
print("\nPattern validation:")
for check, passed in checks.items():
print(f" {'✓' if passed else '✗'} {check}")
Run tests:
python test_model.py
You should see: Code output matching your patterns (internal imports, naming conventions, error handling style). If < 70% of patterns appear, increase epochs to 5 and retrain.
Step 7: Compare to Base Model
Run the same tests against the base model to quantify improvement:
# compare_models.py
import openai
def test_model(model_id, prompt):
response = openai.chat.completions.create(
model=model_id,
messages=[
{"role": "system", "content": "You are a code assistant."},
{"role": "user", "content": prompt}
],
temperature=0.3
)
return response.choices[0].message.content
# Test both models
test_prompt = "Create a user service with database access and error handling"
base_output = test_model("gpt-4o-mini-2024-07-18", test_prompt)
finetuned_output = test_model("ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123", test_prompt)
print("BASE MODEL:")
print(base_output)
print("\n" + "="*60 + "\n")
print("FINE-TUNED MODEL:")
print(finetuned_output)
# Score both
def score_patterns(code):
score = 0
if "your_internal_package" in code: score += 30
if "log." in code or "logger." in code: score += 20
if "try:" in code and "except" in code: score += 20
if code.count("_") > code.count("camelCase"): score += 15 # snake_case
if "# " in code: score += 15 # Has comments
return score
base_score = score_patterns(base_output)
tuned_score = score_patterns(finetuned_output)
print(f"\nBase model score: {base_score}/100")
print(f"Fine-tuned score: {tuned_score}/100")
print(f"Improvement: {tuned_score - base_score} points")
Expected: Fine-tuned model scores 60-80+ points, base model scores 20-40 points. If improvement is < 20 points, you need more training examples or higher quality data.
Verification
Test in production workflow:
# Integrate into your IDE/CLI tool
response = openai.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:company:company-style-v1:AbC123",
messages=[
{"role": "system", "content": "You are a code assistant following our company style guide."},
{"role": "user", "content": "Write a payment processing service"}
]
)
You should see: Generated code that requires minimal editing, uses your internal libraries, follows your error handling patterns, and matches your naming conventions.
Measure success:
- Before fine-tuning: Developers spend 40% of time editing AI-generated code
- After fine-tuning: Editing time drops to 10-15%
- Break-even: If your team generates 200+ snippets/month, you'll recoup training costs in 2-3 months
What You Learned
- Fine-tuning works best with 100+ consistent examples of your patterns
- GPT-4o mini costs ~$15-30 to train on 150 examples, Claude requires sales contact
- Testing against base model quantifies if fine-tuning improved output
- ROI depends on usage volume - low usage teams should use better prompting instead
Limitations:
- Model still needs good prompts - fine-tuning isn't magic
- Patterns drift as your codebase evolves, retrain every 6 months
- Doesn't learn from feedback during generation (that requires RLHF)
Tested with OpenAI GPT-4o mini (Feb 2024), Python 3.11, openai==1.12.0