Fine-tuning with synthetic data generated by GPT-4o is now the fastest way to teach a smaller model a specialized skill — without spending weeks on manual labeling. This guide walks you through generating a structured JSONL dataset, validating it, and submitting it for fine-tuning via the OpenAI API, all in Python 3.12.
You'll learn:
- How to prompt GPT-4o to produce consistent, schema-valid training examples
- How to format and validate a JSONL dataset for
gpt-4o-minifine-tuning - How to upload, start, and monitor a fine-tuning job programmatically
Time: 25 min | Difficulty: Intermediate
Why Synthetic Data Works for Fine-Tuning
Manual labeling is the bottleneck in most fine-tuning projects. A single human annotator produces ~200 examples per day. GPT-4o produces 200 in under 60 seconds at roughly $0.30 in API costs (at $2.50/1M input tokens, $10/1M output tokens as of March 2026).
The tradeoff: synthetic data inherits GPT-4o's biases. It works best when:
- The task is well-defined — classification, extraction, formatting, Q&A with a narrow domain
- You have at least a few seed examples to anchor the generator's style
- The fine-tuned model needs to replicate a behavior, not discover new knowledge
It works poorly for tasks requiring real-world grounding (medical diagnosis, legal judgment calls) or tasks where GPT-4o itself has a high error rate.
Pipeline: seed examples → GPT-4o generator → JSONL validator → OpenAI fine-tuning API → specialized model
Prerequisites
- Python 3.12+
openai>=1.30.0— install withuv add openaiorpip install openai- An OpenAI API key with fine-tuning access (Tier 1+, starts at $5 prepaid credit)
- ~500 MB disk for dataset files
# Verify your setup
python --version # 3.12.x
pip show openai | grep Version # 1.30.0 or higher
Step 1: Define Your Task Schema
The most important decision is what each training example looks like. The OpenAI fine-tuning API expects JSONL where each line is a messages array — the same format as a chat completion.
For this guide, the task is intent classification: given a user message from a customer support chat, classify the intent into one of five categories.
# schema.py
from pydantic import BaseModel
from typing import Literal
INTENTS = Literal[
"billing_issue",
"technical_support",
"account_access",
"feature_request",
"general_inquiry",
]
class TrainingExample(BaseModel):
user_message: str
intent: INTENTS
reasoning: str # Chain-of-thought — stripped before upload, improves generator quality
Store your seed examples in seeds.json — 5 to 10 is enough:
[
{
"user_message": "I was charged twice this month",
"intent": "billing_issue",
"reasoning": "User explicitly mentions a charge, indicating a billing problem."
},
{
"user_message": "The app crashes when I open settings",
"intent": "technical_support",
"reasoning": "Describes a software malfunction, classic technical support signal."
},
{
"user_message": "Can I export my data to CSV?",
"intent": "feature_request",
"reasoning": "User is asking for a capability that may not exist yet."
}
]
Step 2: Build the Generator
This script calls GPT-4o in a loop, uses structured output (response_format) to guarantee schema compliance, and writes valid examples to a JSONL file.
# generate_dataset.py
import json
import random
from pathlib import Path
from openai import OpenAI
from pydantic import ValidationError
from schema import TrainingExample, INTENTS
client = OpenAI() # reads OPENAI_API_KEY from env
SYSTEM_PROMPT = """You are a dataset generator for a customer support intent classifier.
Generate realistic customer support messages and classify them correctly.
Vary message length, tone (frustrated, polite, confused), and phrasing.
Avoid repeating messages from the seed examples."""
def load_seeds(path: str = "seeds.json") -> list[dict]:
return json.loads(Path(path).read_text())
def generate_batch(seeds: list[dict], batch_size: int = 10) -> list[TrainingExample]:
seed_text = json.dumps(random.sample(seeds, min(3, len(seeds))), indent=2)
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
response_format={
"type": "json_schema",
"json_schema": {
"name": "training_batch",
"schema": {
"type": "object",
"properties": {
"examples": {
"type": "array",
"items": TrainingExample.model_json_schema(),
"minItems": batch_size,
"maxItems": batch_size,
}
},
"required": ["examples"],
},
},
},
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": (
f"Generate exactly {batch_size} diverse training examples.\n"
f"Reference seeds for style only — do not copy them:\n{seed_text}"
),
},
],
temperature=0.9, # high temp = more lexical diversity across batches
)
raw = json.loads(response.choices[0].message.content)
validated = []
for item in raw["examples"]:
try:
validated.append(TrainingExample(**item))
except ValidationError as e:
print(f"Skipping invalid example: {e}")
return validated
def to_finetune_line(example: TrainingExample) -> dict:
# Strip reasoning — it's only used to improve generator quality
return {
"messages": [
{
"role": "system",
"content": "Classify the customer support message intent.",
},
{"role": "user", "content": example.user_message},
{"role": "assistant", "content": example.intent},
]
}
def generate_dataset(total: int = 300, batch_size: int = 10, output: str = "dataset.jsonl"):
seeds = load_seeds()
out = Path(output)
out.write_text("") # clear file
written = 0
while written < total:
batch = generate_batch(seeds, batch_size)
with out.open("a") as f:
for example in batch:
f.write(json.dumps(to_finetune_line(example)) + "\n")
written += len(batch)
print(f"Progress: {written}/{total}")
print(f"Dataset written to {output} ({written} examples)")
if __name__ == "__main__":
generate_dataset(total=300)
Run it:
python generate_dataset.py
# Progress: 10/300
# Progress: 20/300
# ...
# Dataset written to dataset.jsonl (300 examples)
Expected cost: ~$0.40 for 300 examples at gpt-4o-2024-08-06 pricing (March 2026).
If it fails:
openai.AuthenticationError→ checkecho $OPENAI_API_KEYis set in your shellValidationErrorspam → lowertemperatureto 0.7; GPT-4o sometimes drifts from schema at very high tempsminItemsnot respected → theresponse_formatJSON Schema constraint is advisory; thewhile written < totalloop compensates
Step 3: Validate the Dataset
OpenAI's fine-tuning API rejects malformed JSONL silently in some cases and with cryptic errors in others. Validate locally first.
# validate_dataset.py
import json
from pathlib import Path
REQUIRED_ROLES = {"system", "user", "assistant"}
def validate(path: str = "dataset.jsonl") -> bool:
lines = Path(path).read_text().strip().splitlines()
errors = 0
for i, line in enumerate(lines):
try:
obj = json.loads(line)
except json.JSONDecodeError as e:
print(f"Line {i}: invalid JSON — {e}")
errors += 1
continue
msgs = obj.get("messages", [])
roles = {m.get("role") for m in msgs}
if not {"user", "assistant"}.issubset(roles):
print(f"Line {i}: missing user or assistant turn")
errors += 1
for m in msgs:
if not m.get("content", "").strip():
print(f"Line {i}: empty content in role '{m.get('role')}'")
errors += 1
print(f"Validated {len(lines)} lines — {errors} errors found")
return errors == 0
if __name__ == "__main__":
validate()
python validate_dataset.py
# Validated 300 lines — 0 errors found
OpenAI recommends a minimum of 10 examples; 50–100 produces noticeable improvement; 300+ gives stable results for classification tasks.
Step 4: Split Train / Validation
Fine-tuning needs a separate validation file to track loss during training. An 80/20 split is standard.
# split_dataset.py
import json, random
from pathlib import Path
lines = Path("dataset.jsonl").read_text().strip().splitlines()
random.shuffle(lines)
split = int(len(lines) * 0.8)
Path("train.jsonl").write_text("\n".join(lines[:split]))
Path("val.jsonl").write_text("\n".join(lines[split:]))
print(f"Train: {split} | Val: {len(lines) - split}")
# Train: 240 | Val: 60
Step 5: Upload and Start the Fine-Tuning Job
# finetune.py
from openai import OpenAI
from pathlib import Path
client = OpenAI()
def upload_file(path: str, purpose: str = "fine-tune") -> str:
with open(path, "rb") as f:
response = client.files.create(file=f, purpose=purpose)
print(f"Uploaded {path} → file ID: {response.id}")
return response.id
def start_finetune(train_id: str, val_id: str, suffix: str = "intent-v1") -> str:
job = client.fine_tuning.jobs.create(
training_file=train_id,
validation_file=val_id,
model="gpt-4o-mini-2024-07-18", # cheapest model that accepts chat fine-tuning
hyperparameters={
"n_epochs": 3, # 3 epochs is the OpenAI default; increase to 5 for small datasets
"batch_size": "auto", # auto = 0.2% of training set, min 1, max 256
"learning_rate_multiplier": "auto",
},
suffix=suffix,
)
print(f"Fine-tuning job started: {job.id}")
return job.id
def monitor(job_id: str):
import time
while True:
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status} | Model: {job.fine_tuned_model}")
if job.status in ("succeeded", "failed", "cancelled"):
break
time.sleep(30) # poll every 30s — OpenAI jobs typically take 10–30 min for small datasets
if __name__ == "__main__":
train_id = upload_file("train.jsonl")
val_id = upload_file("val.jsonl")
job_id = start_finetune(train_id, val_id)
monitor(job_id)
python finetune.py
# Uploaded train.jsonl → file ID: file-abc123
# Uploaded val.jsonl → file ID: file-def456
# Fine-tuning job started: ftjob-xyz789
# Status: validating_files | Model: None
# Status: running | Model: None
# ...
# Status: succeeded | Model: ft:gpt-4o-mini-2024-07-18:org:intent-v1:abc123
Expected cost for 300-example fine-tuning job: ~$0.60–$1.20 (at $0.003/1K training tokens, 3 epochs).
If it fails:
status: failed→ callclient.fine_tuning.jobs.list_events(job_id)for the error log- High validation loss → add more diverse seeds, reduce temperature, or increase
n_epochsto 5 InvalidRequestError: training file too small→ add more examples; minimum is 10, but 50 is practical
Step 6: Test the Fine-Tuned Model
# test_model.py
from openai import OpenAI
client = OpenAI()
MODEL = "ft:gpt-4o-mini-2024-07-18:org:intent-v1:abc123" # replace with your model ID
test_messages = [
"My invoice shows a charge I don't recognize",
"The login button doesn't do anything on Firefox",
"Would it be possible to add dark mode?",
]
for msg in test_messages:
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Classify the customer support message intent."},
{"role": "user", "content": msg},
],
max_tokens=20,
temperature=0, # deterministic output for classification
)
intent = response.choices[0].message.content.strip()
print(f"{msg!r} → {intent}")
python test_model.py
# 'My invoice shows a charge I don't recognize' → billing_issue
# 'The login button doesn't do anything on Firefox' → technical_support
# 'Would it be possible to add dark mode?' → feature_request
You should see: correct intent labels with zero latency degradation vs. base gpt-4o-mini. The fine-tuned model skips the multi-sentence explanation the base model would produce and returns only the label — exactly the behavior the training data enforced.
Dataset Size vs. Quality: When to Stop Generating
| Examples | Expected accuracy lift | Fine-tune cost (USD) |
|---|---|---|
| 50 | Noticeable for narrow tasks | ~$0.15 |
| 100–200 | Solid for 3–5 class classification | ~$0.30–$0.60 |
| 300–500 | Production-stable | ~$0.90–$1.50 |
| 1,000+ | Diminishing returns unless task is complex | $3.00+ |
Stop generating when your validation loss plateaus across epochs — not at an arbitrary example count. Check the OpenAI fine-tuning dashboard or poll client.fine_tuning.jobs.list_events(job_id) for per-epoch loss values.
What You Learned
- GPT-4o's structured output (
response_formatwith JSON Schema) is the reliable way to generate schema-valid training examples at scale — plain prompting produces too many malformed outputs to be practical - Reasoning fields (
reasoningin the schema) improve generator diversity even when you strip them before upload — they act as an internal chain-of-thought that steers GPT-4o toward better examples temperature=0at inference time is essential for classification fine-tunes — the model has learned to output a specific token and any randomness degrades accuracy
Tested on Python 3.12.3, openai 1.35.0, gpt-4o-2024-08-06 generator, gpt-4o-mini-2024-07-18 fine-tuning target. macOS 15 and Ubuntu 24.04.
FAQ
Q: How many synthetic examples do I need before fine-tuning outperforms few-shot prompting? A: For classification with 3–5 classes, fine-tuning on 100+ synthetic examples typically beats a 10-shot prompt. For tasks with 10+ classes or nuanced outputs, aim for 300+.
Q: Does GPT-4o-generated data introduce hallucination bias into the fine-tuned model? A: Yes, if GPT-4o hallucinates on your task, so will your dataset. Mitigate it by reviewing a 10% random sample before uploading and filtering out low-confidence examples using a second GPT-4o pass.
Q: Can I use this pipeline with open-source models like Llama 3.3 or Mistral instead of gpt-4o-mini? A: Yes. The JSONL format is compatible with most fine-tuning frameworks (Axolotl, Unsloth, TRL). Replace the upload/finetune steps with your framework's training script.
Q: What is the minimum OpenAI account tier needed for fine-tuning?
A: Tier 1 (at least $5 in API usage). Fine-tuning on gpt-4o-mini is available immediately at Tier 1. Fine-tuning on gpt-4o requires Tier 4 ($250+ usage).
Q: Does OpenAI charge for storing uploaded training files?
A: Uploaded files are free to store but expire after 30 days unless you set expires_at. Delete them with client.files.delete(file_id) after training to stay clean.