Problem: Faker.js Creates Obvious Fake Data
Your test data looks like "John Doe" with email "test123@example.com" and everyone notices it's fake. You need realistic, context-aware mock data that passes human review.
You'll learn:
- Generate domain-specific JSON with Claude API
- Create consistent mock data across related fields
- Build reusable templates for common data types
Time: 12 min | Level: Intermediate
Why This Happens
Traditional faker libraries generate random data without context. An AI profile might have name "Sarah Chen" but email "bob.smith@company.com" - breaking realism. AI models understand semantic relationships between fields.
Common symptoms:
- Names don't match email addresses
- Addresses conflict with phone area codes
- Job titles don't match company industries
- Demo data immediately identified as fake
Solution
Step 1: Set Up Your Environment
Install the Anthropic SDK. Works with Claude Sonnet 4.5 or local models via Ollama.
pip install anthropic python-dotenv --break-system-packages
Expected: Clean installation with no dependency conflicts.
If it fails:
- Error: "externally-managed-environment": Use the
--break-system-packagesflag shown above - Still failing: Create a venv with
python -m venv .venv && source .venv/bin/activate
Step 2: Create the Generator
Build a reusable mock data generator with proper error handling.
import anthropic
import json
import os
from typing import Dict, List, Any
class MockDataGenerator:
def __init__(self, api_key: str = None):
# Use provided key or fall back to environment variable
self.client = anthropic.Anthropic(
api_key=api_key or os.getenv("ANTHROPIC_API_KEY")
)
def generate(self, schema: Dict[str, Any], count: int = 10) -> List[Dict]:
"""
Generate mock data matching the schema.
Args:
schema: Field definitions with types and constraints
count: Number of records to generate
Returns:
List of mock data dictionaries
"""
prompt = self._build_prompt(schema, count)
response = self.client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4000,
messages=[{
"role": "user",
"content": prompt
}]
)
# Extract JSON from response
content = response.content[0].text
# Handle markdown code blocks if present
if "```json" in content:
content = content.split("```json")[1].split("```")[0]
elif "```" in content:
content = content.split("```")[1].split("```")[0]
return json.loads(content.strip())
def _build_prompt(self, schema: Dict[str, Any], count: int) -> str:
"""Build a detailed prompt for the AI model."""
schema_desc = json.dumps(schema, indent=2)
return f"""Generate {count} realistic mock data records as a JSON array.
Schema requirements:
{schema_desc}
Rules:
- Ensure semantic consistency (names match emails, locations match phone codes)
- Use realistic values appropriate to the field context
- Vary the data naturally - avoid repetitive patterns
- Return ONLY valid JSON, no explanations or markdown
Output format: [{{...}}, {{...}}, ...]"""
def save(self, data: List[Dict], filename: str) -> None:
"""Save generated data to a JSON file."""
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
print(f"Saved {len(data)} records to {filename}")
Why this works: The prompt instructs the model to maintain consistency across related fields. The schema provides structure while allowing natural variation.
Step 3: Define Your Schema
Create schemas for your specific use case. Here's an employee directory example.
# Define schema with constraints
employee_schema = {
"fields": {
"id": "UUID v4 format",
"firstName": "Common first name matching ethnicity",
"lastName": "Common last name matching ethnicity",
"email": "Corporate email: firstname.lastname@company.com (lowercase)",
"phone": "US format: (XXX) XXX-XXXX",
"department": "One of: Engineering, Sales, Marketing, HR, Finance",
"title": "Job title appropriate to department",
"location": "US city with state code",
"salary": "Integer between 50000-200000, realistic for title",
"hireDate": "Date between 2018-01-01 and 2025-12-31, ISO format",
"isActive": "Boolean, 90% true"
},
"consistency_rules": [
"Email must be derived from firstName and lastName",
"Title must be appropriate for department",
"Salary must match seniority implied by title",
"Phone area code should match location when possible"
]
}
# Generate data
generator = MockDataGenerator()
employees = generator.generate(employee_schema, count=20)
generator.save(employees, "employees.json")
Expected: A JSON file with 20 employee records where names, emails, and roles make sense together.
Step 4: Add Domain-Specific Templates
Build templates for common scenarios you need frequently.
# E-commerce products
product_schema = {
"fields": {
"id": "UUID v4",
"name": "Creative product name matching category",
"category": "One of: Electronics, Clothing, Home, Sports, Books",
"price": "Float between 9.99-999.99, realistic for category",
"description": "2-3 sentences describing the product naturally",
"sku": "Format: CAT-XXXXX where CAT is 3-letter category code",
"inStock": "Boolean, 80% true",
"rating": "Float between 3.5-5.0",
"reviewCount": "Integer between 0-5000"
}
}
# User profiles with social data
user_schema = {
"fields": {
"id": "UUID v4",
"username": "Lowercase, 6-15 chars, no special chars except underscore",
"displayName": "Real-looking full name",
"bio": "1-2 sentence user bio reflecting interests",
"avatar": "URL format: https://i.pravatar.cc/150?img=[1-70]",
"followers": "Integer between 0-10000",
"following": "Integer between 0-1000",
"posts": "Integer between 0-500",
"verified": "Boolean, 10% true",
"joinDate": "ISO date between 2020-01-01 and 2026-01-01"
},
"consistency_rules": [
"displayName and username should feel related but not identical",
"High follower counts should correlate with verified status",
"Bio should reflect a coherent interest or profession"
]
}
# Generate both datasets
products = generator.generate(product_schema, count=30)
users = generator.generate(user_schema, count=50)
generator.save(products, "products.json")
generator.save(users, "users.json")
Why multiple schemas: Different domains need different consistency rules. Products need category-price alignment; users need bio-username coherence.
If it fails:
- Error: "Invalid JSON": Model sometimes adds explanations. The code handles this by stripping markdown blocks.
- Data looks random: Strengthen your consistency_rules section with more specific requirements.
- Rate limited: Add a small delay between calls:
import time; time.sleep(1)after each generate().
Verification
Test that generated data meets your requirements.
# Validate email format matches name
def test_email_consistency(employees):
for emp in employees:
expected_prefix = f"{emp['firstName']}.{emp['lastName']}".lower()
actual_prefix = emp['email'].split('@')[0]
assert expected_prefix == actual_prefix, \
f"Email {emp['email']} doesn't match name {emp['firstName']} {emp['lastName']}"
print("✓ All emails match names")
# Check salary ranges by title
def test_salary_realism(employees):
senior_roles = ['Director', 'VP', 'Senior', 'Lead', 'Principal']
for emp in employees:
is_senior = any(role in emp['title'] for role in senior_roles)
if is_senior:
assert emp['salary'] > 100000, \
f"{emp['title']} salary {emp['salary']} too low"
print("✓ Salaries appropriate for titles")
# Run tests
test_email_consistency(employees)
test_salary_realism(employees)
You should see: Both checks pass, confirming data quality.
Advanced: Using Local Models
Save costs with Ollama and Llama 3.1 for bulk generation.
import requests
class LocalMockDataGenerator(MockDataGenerator):
def __init__(self, model: str = "llama3.1:8b"):
# Override to use local Ollama instead of API
self.model = model
self.base_url = "http://localhost:11434"
def generate(self, schema: Dict[str, Any], count: int = 10) -> List[Dict]:
prompt = self._build_prompt(schema, count)
response = requests.post(
f"{self.base_url}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False,
"options": {
"temperature": 0.7 # Some randomness for variety
}
}
)
content = response.json()["response"]
# Same JSON extraction logic
if "```json" in content:
content = content.split("```json")[1].split("```")[0]
elif "```" in content:
content = content.split("```")[1].split("```")[0]
return json.loads(content.strip())
# Use local model
local_gen = LocalMockDataGenerator()
local_employees = local_gen.generate(employee_schema, count=100)
Setup: Install Ollama from ollama.ai, then run ollama pull llama3.1:8b before using this code.
Trade-offs: Local models are free but may need stronger prompts for consistency. Claude API is more reliable but costs ~$0.02 per 100 records.
What You Learned
- AI models understand semantic relationships faker libraries miss
- Explicit consistency rules prevent mismatched field values
- Schema-based generation works for any JSON structure
Limitations:
- API costs scale with data volume - use local models for >10k records
- Complex business logic needs post-processing validation
- Requires internet connection unless using local models
When NOT to use this:
- Simple randomization (faker.js is faster)
- Privacy-sensitive data (use anonymization instead)
- Performance-critical hot paths (pre-generate and cache)
Tested with Claude Sonnet 4.5, Python 3.12, Ollama 0.1.23, macOS & Ubuntu