Database Seeding: Populate SQL with Realistic Data Using AI

Create production-quality test data for relational databases in minutes using Claude's API - no manual CSV writing required.

You'll learn:

Generate complex relational data with foreign key constraints
Build a reusable seeding script with Claude API
Handle edge cases like circular dependencies and data consistency

Time: 20 min | Level: Intermediate

Problem: Test Data That Actually Makes Sense

You need realistic data for your SQL database - users, orders, products with proper relationships - but manual seeding is tedious and fake data libraries don't understand your schema.

Common symptoms:

Orphaned foreign keys breaking queries
Unrealistic data patterns (all orders on same date)
Time wasted writing INSERT statements
Can't test edge cases without custom data

Why AI-Powered Seeding Works

Traditional seeders use templates or random generation, but they don't understand relational constraints. Claude can read your schema and generate contextually correct data that respects foreign keys, business logic, and realistic patterns.

What makes this different:

Generates data that matches your domain (e-commerce, SaaS, healthcare)
Maintains referential integrity automatically
Creates edge cases on demand (high-value customers, canceled orders)
Adapts to schema changes without rewriting code

Solution

Step 1: Set Up the Seeding Script

Create a Node.js script that reads your schema and calls Claude:

// seed-db.js
import Anthropic from '@anthropic-ai/sdk';
import { createPool } from '@vercel/postgres';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

const db = createPool({
  connectionString: process.env.DATABASE_URL
});

async function generateSeedData(schema, rowCounts) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 4000,
    messages: [{
      role: 'user',
      content: `Generate SQL INSERT statements for this schema:

${schema}

Requirements:
- ${rowCounts.users} users
- ${rowCounts.products} products  
- ${rowCounts.orders} orders (realistic distribution across users)
- Maintain all foreign key constraints
- Use realistic data (names, emails, prices)
- Return ONLY valid SQL INSERT statements, no explanations`
    }]
  });
  
  return response.content[0].text;
}

Why this works: Claude sees the entire schema context, so it generates foreign keys that reference actual IDs it created earlier.

Step 2: Define Your Schema

const schema = `
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  email VARCHAR(255) UNIQUE NOT NULL,
  name VARCHAR(100) NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE products (
  id SERIAL PRIMARY KEY,
  name VARCHAR(200) NOT NULL,
  price DECIMAL(10,2) NOT NULL,
  stock INT DEFAULT 0
);

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  user_id INT REFERENCES users(id),
  total DECIMAL(10,2) NOT NULL,
  status VARCHAR(20) DEFAULT 'pending',
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
  id SERIAL PRIMARY KEY,
  order_id INT REFERENCES orders(id),
  product_id INT REFERENCES products(id),
  quantity INT NOT NULL,
  price DECIMAL(10,2) NOT NULL
);
`;

const rowCounts = {
  users: 50,
  products: 20,
  orders: 100
};

Expected: Schema includes foreign keys - Claude will respect these when generating IDs.

Step 3: Execute the Seed

async function seed() {
  try {
    // Generate SQL from Claude
    console.log('Generating seed data...');
    const sql = await generateSeedData(schema, rowCounts);
    
    // Execute in transaction
    await db.query('BEGIN');
    await db.query(sql);
    await db.query('COMMIT');
    
    console.log('✓ Database seeded successfully');
  } catch (error) {
    await db.query('ROLLBACK');
    console.error('Seed failed:', error.message);
  } finally {
    await db.end();
  }
}

seed();

Run it:

node seed-db.js

If it fails:

Error: "duplicate key value": Add TRUNCATE TABLE users, orders, products CASCADE; before seeding
Error: "foreign key violation": Check table order - Claude might generate orders before users

Step 4: Handle Table Dependencies

Fix ordering issues by explicitly sequencing tables:

async function generateSeedData(schema, rowCounts) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 4000,
    messages: [{
      role: 'user',
      content: `Generate SQL INSERT statements in this exact order:
1. users table (${rowCounts.users} rows)
2. products table (${rowCounts.products} rows)
3. orders table (${rowCounts.orders} rows, reference user IDs 1-${rowCounts.users})
4. order_items table (2-5 items per order, reference existing order and product IDs)

Schema:
${schema}

Rules:
- Start IDs at 1 for each table
- orders.user_id must be between 1 and ${rowCounts.users}
- orders.total should match sum of order_items.price * quantity
- Return raw SQL only, no markdown`
    }]
  });
  
  return response.content[0].text;
}

Why this matters: Explicit ordering prevents FK violations. Claude follows instructions to reference valid IDs.

Step 5: Add Edge Cases

Generate specific test scenarios:

const edgeCasePrompt = `
Also include:
- 3 users with no orders (test null relationships)
- 1 high-value customer with 10+ orders
- 2 products with 0 stock (test inventory logic)
- 5 cancelled orders (status = 'cancelled')
`;

// Append to existing prompt
content: `${basePrompt}\n\n${edgeCasePrompt}`

Expected: SQL includes outliers your tests need, not just average-case data.

Verification

# Check row counts
psql $DATABASE_URL -c "SELECT 
  (SELECT COUNT(*) FROM users) as users,
  (SELECT COUNT(*) FROM orders) as orders,
  (SELECT COUNT(*) FROM products) as products;"

You should see: Exact counts matching your rowCounts config.

Validate relationships:

-- Check for orphaned orders
SELECT COUNT(*) FROM orders o 
LEFT JOIN users u ON o.user_id = u.id 
WHERE u.id IS NULL;

Expected: 0 rows (all foreign keys valid).

Advanced: Reusable Seeder Class

class AISeedGenerator {
  constructor(apiKey, dbUrl) {
    this.anthropic = new Anthropic({ apiKey });
    this.db = createPool({ connectionString: dbUrl });
  }
  
  async seed(schema, config) {
    const sql = await this.generateSQL(schema, config);
    await this.execute(sql);
  }
  
  async generateSQL(schema, config) {
    // Add retry logic for rate limits
    let attempts = 0;
    while (attempts < 3) {
      try {
        const response = await this.anthropic.messages.create({
          model: 'claude-sonnet-4-20250514',
          max_tokens: 4000,
          messages: [{ 
            role: 'user', 
            content: this.buildPrompt(schema, config) 
          }]
        });
        return this.cleanSQL(response.content[0].text);
      } catch (error) {
        if (error.status === 429) {
          await this.sleep(2000 * (attempts + 1));
          attempts++;
        } else throw error;
      }
    }
  }
  
  cleanSQL(raw) {
    // Remove markdown code blocks if present
    return raw
      .replace(/```sql\n?/g, '')
      .replace(/```\n?/g, '')
      .trim();
  }
  
  async execute(sql) {
    await this.db.query('BEGIN');
    try {
      // Split on semicolons, execute sequentially
      const statements = sql.split(';').filter(s => s.trim());
      for (const stmt of statements) {
        await this.db.query(stmt);
      }
      await this.db.query('COMMIT');
    } catch (error) {
      await this.db.query('ROLLBACK');
      throw error;
    }
  }
  
  buildPrompt(schema, config) {
    return `Generate SQL INSERTs for this schema with ${config.rows} realistic records per table:
${schema}

Requirements:
${config.requirements || '- Realistic data\n- Valid foreign keys'}

Return raw SQL only.`;
  }
  
  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage
const seeder = new AISeedGenerator(
  process.env.ANTHROPIC_API_KEY,
  process.env.DATABASE_URL
);

await seeder.seed(schema, {
  rows: 100,
  requirements: '- Include edge cases\n- Realistic timestamps'
});

What You Learned

AI understands relational schemas and generates valid foreign keys
Explicit table ordering prevents constraint violations
Edge cases can be prompted, not manually coded
Claude API handles complex data generation better than templates

Limitations:

Large schemas (50+ tables) may need chunking
Costs ~$0.01-0.05 per seed run depending on data volume
Generated data isn't deterministic (varies per run)

When NOT to use this:

Production data migrations (use proper ETL tools)
Compliance-sensitive data (use anonymization tools)
Performance testing needing millions of rows (use bulk generators)

Tested with Claude Sonnet 4, PostgreSQL 16.x, Node.js 22.x