Generate Realistic Mock Data with AI in 12 Minutes

Create production-quality JSON datasets using Claude API and local LLMs. Skip faker.js limitations with context-aware mock data.

Problem: Faker.js Creates Obvious Fake Data

Your test data looks like "John Doe" with email "test123@example.com" and everyone notices it's fake. You need realistic, context-aware mock data that passes human review.

You'll learn:

  • Generate domain-specific JSON with Claude API
  • Create consistent mock data across related fields
  • Build reusable templates for common data types

Time: 12 min | Level: Intermediate


Why This Happens

Traditional faker libraries generate random data without context. An AI profile might have name "Sarah Chen" but email "bob.smith@company.com" - breaking realism. AI models understand semantic relationships between fields.

Common symptoms:

  • Names don't match email addresses
  • Addresses conflict with phone area codes
  • Job titles don't match company industries
  • Demo data immediately identified as fake

Solution

Step 1: Set Up Your Environment

Install the Anthropic SDK. Works with Claude Sonnet 4.5 or local models via Ollama.

pip install anthropic python-dotenv --break-system-packages

Expected: Clean installation with no dependency conflicts.

If it fails:

  • Error: "externally-managed-environment": Use the --break-system-packages flag shown above
  • Still failing: Create a venv with python -m venv .venv && source .venv/bin/activate

Step 2: Create the Generator

Build a reusable mock data generator with proper error handling.

import anthropic
import json
import os
from typing import Dict, List, Any

class MockDataGenerator:
    def __init__(self, api_key: str = None):
        # Use provided key or fall back to environment variable
        self.client = anthropic.Anthropic(
            api_key=api_key or os.getenv("ANTHROPIC_API_KEY")
        )
        
    def generate(self, schema: Dict[str, Any], count: int = 10) -> List[Dict]:
        """
        Generate mock data matching the schema.
        
        Args:
            schema: Field definitions with types and constraints
            count: Number of records to generate
            
        Returns:
            List of mock data dictionaries
        """
        prompt = self._build_prompt(schema, count)
        
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4000,
            messages=[{
                "role": "user",
                "content": prompt
            }]
        )
        
        # Extract JSON from response
        content = response.content[0].text
        
        # Handle markdown code blocks if present
        if "```json" in content:
            content = content.split("```json")[1].split("```")[0]
        elif "```" in content:
            content = content.split("```")[1].split("```")[0]
            
        return json.loads(content.strip())
    
    def _build_prompt(self, schema: Dict[str, Any], count: int) -> str:
        """Build a detailed prompt for the AI model."""
        schema_desc = json.dumps(schema, indent=2)
        
        return f"""Generate {count} realistic mock data records as a JSON array.

Schema requirements:
{schema_desc}

Rules:
- Ensure semantic consistency (names match emails, locations match phone codes)
- Use realistic values appropriate to the field context
- Vary the data naturally - avoid repetitive patterns
- Return ONLY valid JSON, no explanations or markdown

Output format: [{{...}}, {{...}}, ...]"""

    def save(self, data: List[Dict], filename: str) -> None:
        """Save generated data to a JSON file."""
        with open(filename, 'w') as f:
            json.dump(data, f, indent=2)
        print(f"Saved {len(data)} records to {filename}")

Why this works: The prompt instructs the model to maintain consistency across related fields. The schema provides structure while allowing natural variation.


Step 3: Define Your Schema

Create schemas for your specific use case. Here's an employee directory example.

# Define schema with constraints
employee_schema = {
    "fields": {
        "id": "UUID v4 format",
        "firstName": "Common first name matching ethnicity",
        "lastName": "Common last name matching ethnicity",
        "email": "Corporate email: firstname.lastname@company.com (lowercase)",
        "phone": "US format: (XXX) XXX-XXXX",
        "department": "One of: Engineering, Sales, Marketing, HR, Finance",
        "title": "Job title appropriate to department",
        "location": "US city with state code",
        "salary": "Integer between 50000-200000, realistic for title",
        "hireDate": "Date between 2018-01-01 and 2025-12-31, ISO format",
        "isActive": "Boolean, 90% true"
    },
    "consistency_rules": [
        "Email must be derived from firstName and lastName",
        "Title must be appropriate for department",
        "Salary must match seniority implied by title",
        "Phone area code should match location when possible"
    ]
}

# Generate data
generator = MockDataGenerator()
employees = generator.generate(employee_schema, count=20)
generator.save(employees, "employees.json")

Expected: A JSON file with 20 employee records where names, emails, and roles make sense together.


Step 4: Add Domain-Specific Templates

Build templates for common scenarios you need frequently.

# E-commerce products
product_schema = {
    "fields": {
        "id": "UUID v4",
        "name": "Creative product name matching category",
        "category": "One of: Electronics, Clothing, Home, Sports, Books",
        "price": "Float between 9.99-999.99, realistic for category",
        "description": "2-3 sentences describing the product naturally",
        "sku": "Format: CAT-XXXXX where CAT is 3-letter category code",
        "inStock": "Boolean, 80% true",
        "rating": "Float between 3.5-5.0",
        "reviewCount": "Integer between 0-5000"
    }
}

# User profiles with social data
user_schema = {
    "fields": {
        "id": "UUID v4",
        "username": "Lowercase, 6-15 chars, no special chars except underscore",
        "displayName": "Real-looking full name",
        "bio": "1-2 sentence user bio reflecting interests",
        "avatar": "URL format: https://i.pravatar.cc/150?img=[1-70]",
        "followers": "Integer between 0-10000",
        "following": "Integer between 0-1000",
        "posts": "Integer between 0-500",
        "verified": "Boolean, 10% true",
        "joinDate": "ISO date between 2020-01-01 and 2026-01-01"
    },
    "consistency_rules": [
        "displayName and username should feel related but not identical",
        "High follower counts should correlate with verified status",
        "Bio should reflect a coherent interest or profession"
    ]
}

# Generate both datasets
products = generator.generate(product_schema, count=30)
users = generator.generate(user_schema, count=50)

generator.save(products, "products.json")
generator.save(users, "users.json")

Why multiple schemas: Different domains need different consistency rules. Products need category-price alignment; users need bio-username coherence.

If it fails:

  • Error: "Invalid JSON": Model sometimes adds explanations. The code handles this by stripping markdown blocks.
  • Data looks random: Strengthen your consistency_rules section with more specific requirements.
  • Rate limited: Add a small delay between calls: import time; time.sleep(1) after each generate().

Verification

Test that generated data meets your requirements.

# Validate email format matches name
def test_email_consistency(employees):
    for emp in employees:
        expected_prefix = f"{emp['firstName']}.{emp['lastName']}".lower()
        actual_prefix = emp['email'].split('@')[0]
        assert expected_prefix == actual_prefix, \
            f"Email {emp['email']} doesn't match name {emp['firstName']} {emp['lastName']}"
    print("✓ All emails match names")

# Check salary ranges by title
def test_salary_realism(employees):
    senior_roles = ['Director', 'VP', 'Senior', 'Lead', 'Principal']
    for emp in employees:
        is_senior = any(role in emp['title'] for role in senior_roles)
        if is_senior:
            assert emp['salary'] > 100000, \
                f"{emp['title']} salary {emp['salary']} too low"
    print("✓ Salaries appropriate for titles")

# Run tests
test_email_consistency(employees)
test_salary_realism(employees)

You should see: Both checks pass, confirming data quality.


Advanced: Using Local Models

Save costs with Ollama and Llama 3.1 for bulk generation.

import requests

class LocalMockDataGenerator(MockDataGenerator):
    def __init__(self, model: str = "llama3.1:8b"):
        # Override to use local Ollama instead of API
        self.model = model
        self.base_url = "http://localhost:11434"
    
    def generate(self, schema: Dict[str, Any], count: int = 10) -> List[Dict]:
        prompt = self._build_prompt(schema, count)
        
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.7  # Some randomness for variety
                }
            }
        )
        
        content = response.json()["response"]
        
        # Same JSON extraction logic
        if "```json" in content:
            content = content.split("```json")[1].split("```")[0]
        elif "```" in content:
            content = content.split("```")[1].split("```")[0]
            
        return json.loads(content.strip())

# Use local model
local_gen = LocalMockDataGenerator()
local_employees = local_gen.generate(employee_schema, count=100)

Setup: Install Ollama from ollama.ai, then run ollama pull llama3.1:8b before using this code.

Trade-offs: Local models are free but may need stronger prompts for consistency. Claude API is more reliable but costs ~$0.02 per 100 records.


What You Learned

  • AI models understand semantic relationships faker libraries miss
  • Explicit consistency rules prevent mismatched field values
  • Schema-based generation works for any JSON structure

Limitations:

  • API costs scale with data volume - use local models for >10k records
  • Complex business logic needs post-processing validation
  • Requires internet connection unless using local models

When NOT to use this:

  • Simple randomization (faker.js is faster)
  • Privacy-sensitive data (use anonymization instead)
  • Performance-critical hot paths (pre-generate and cache)

Tested with Claude Sonnet 4.5, Python 3.12, Ollama 0.1.23, macOS & Ubuntu