That sinking feeling when your database server crashes and your last backup is from yesterday? I've been there.
I spent two sleepless nights recovering from a production database failure because I thought daily backups were enough. They weren't.
What you'll build: A PostgreSQL primary-replica setup with real-time streaming replication Time needed: 30 minutes (including testing) Difficulty: Intermediate - requires basic PostgreSQL knowledge
This setup automatically keeps your replica database in sync with your primary. When your primary fails, you can promote the replica in under 60 seconds instead of hours of restore time.
Why I Built This
My situation:
- E-commerce app with 50,000+ daily users
- Single PostgreSQL server (huge mistake)
- Daily pg_dump backups taking 2+ hours
- Zero tolerance for data loss
What didn't work:
- pg_dump backups: Too slow to restore, missed transactions between backups
- Simple file copying: Corrupted databases, inconsistent state
- Third-party tools: Expensive, complex, vendor lock-in
The breaking point: Primary server died at 3 AM on Black Friday. Spent 6 hours restoring from backup and lost $12,000 in sales from the downtime.
Understanding PostgreSQL Streaming Replication
The problem: Your database is a single point of failure
My solution: Streaming replication creates a live copy that stays synchronized automatically
Time this saves: Recovery goes from hours to seconds
How It Actually Works
PostgreSQL streaming replication uses Write-Ahead Logging (WAL). Here's what happens:
- Primary server writes changes to WAL files
- Replica connects and requests WAL data
- Primary streams WAL records to replica in real-time
- Replica applies changes immediately
Personal insight: Think of it like Google Docs - changes appear on all connected devices instantly, but one device is the "owner" that accepts new edits.
Step 1: Set Up Your Primary Server
The problem: Default PostgreSQL settings block replication
My solution: Enable WAL archiving and configure connection permissions
Time this saves: Prevents hours of "connection refused" debugging
Configure postgresql.conf
Edit your PostgreSQL configuration:
# Find your config file location
sudo -u postgres psql -c "SHOW config_file;"
# Edit the main config
sudo nano /etc/postgresql/15/main/postgresql.conf
Add these settings:
# Enable WAL archiving for replication
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3
# Connection settings
listen_addresses = '*'
port = 5432
# Performance tuning (adjust for your hardware)
shared_buffers = 256MB
wal_buffers = 16MB
checkpoint_completion_target = 0.9
What this does:
wal_level = replicaenables replication-level WAL loggingmax_wal_senders = 3allows up to 3 replica connectionslisten_addresses = '*'accepts connections from any IP
Expected output: No output means success
My actual config file - yours should match these critical lines
Personal tip: "Set max_wal_senders to number of replicas + 1 for maintenance connections. I learned this when my replica couldn't connect during a backup."
Configure pg_hba.conf for Replica Access
Edit the host-based authentication file:
sudo nano /etc/postgresql/15/main/pg_hba.conf
Add this line at the top of the file:
# Allow replication connections from replica server
host replication replicator 192.168.1.101/32 md5
Replace 192.168.1.101 with your replica server's IP address.
What this does: Allows the replicator user to connect for replication from your replica server
Personal tip: "Put this line BEFORE the general connection rules, or it won't work. Spent 2 hours debugging this ordering issue."
Create the Replication User
sudo -u postgres psql
-- Create dedicated replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'your_secure_password_here';
-- Verify the user was created
\du replicator
Expected output:
List of roles
Role name | Attributes
------------+------------------------------------------------------------
replicator | Replication
Personal tip: "Use a strong password and store it securely. This user can access all your database data."
Restart Primary Server
sudo systemctl restart postgresql
Verify it's running:
sudo systemctl status postgresql
Expected output: Active (running) status with no error messages
Successful restart - if you see failed status, check the error logs
Step 2: Prepare Your Replica Server
The problem: Replica needs identical PostgreSQL version and initial data
My solution: Install matching PostgreSQL and create base backup
Time this saves: Prevents version mismatch errors that waste hours
Install PostgreSQL on Replica Server
# Install same version as primary
sudo apt update
sudo apt install postgresql-15 postgresql-client-15
# Stop the service (we'll replace the data directory)
sudo systemctl stop postgresql
Create Initial Base Backup
On your replica server, remove the default data directory and create a backup from primary:
# Remove default data directory
sudo rm -rf /var/lib/postgresql/15/main/*
# Create base backup from primary server
sudo -u postgres pg_basebackup -h 192.168.1.100 -D /var/lib/postgresql/15/main -U replicator -P -v -R -W
Replace 192.168.1.100 with your primary server's IP.
Command breakdown:
-h: Primary server hostname/IP-D: Target data directory-U: Replication user-P: Show progress-v: Verbose output-R: Create standby.signal file automatically-W: Prompt for password
Expected output:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/2000000 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_12345"
24286/24286 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/2000100
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: syncing data to disk...
pg_basebackup: base backup completed
Base backup progress - took 3 minutes for my 500MB database
Personal tip: "The backup time depends on your database size and network speed. My 2GB database took 8 minutes over gigabit ethernet."
Step 3: Configure Replica Server
The problem: Replica needs connection details to primary
My solution: Configure postgresql.conf and connection string
Time this saves: Gets replication working immediately
Configure Replica postgresql.conf
sudo nano /etc/postgresql/15/main/postgresql.conf
Add/modify these settings:
# Basic settings (same as primary)
listen_addresses = '*'
port = 5432
# Replica-specific settings
hot_standby = on
max_standby_streaming_delay = 30s
wal_receiver_status_interval = 10s
hot_standby_feedback = on
What this does:
hot_standby = onallows read-only queries on replicamax_standby_streaming_delayprevents long queries from blocking replicationwal_receiver_status_intervalsends status updates to primaryhot_standby_feedbackprevents query conflicts
Configure Primary Connection
The pg_basebackup -R command should have created this file automatically:
sudo -u postgres cat /var/lib/postgresql/15/main/postgresql.auto.conf
Expected content:
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
primary_conninfo = 'user=replicator password=your_password host=192.168.1.100 port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'
Personal tip: "If this file is empty, pg_basebackup didn't create it. Check your -R flag and PostgreSQL version compatibility."
Start Replica Server
sudo systemctl start postgresql
sudo systemctl status postgresql
Expected output: Active (running) status
Replica server running - replication should start automatically
Step 4: Verify Replication is Working
The problem: How do you know replication actually works?
My solution: Test with real data and check replication status
Time this saves: Catches configuration problems before you need failover
Check Replication Status on Primary
sudo -u postgres psql
-- Check connected replicas
SELECT client_addr, state, sync_state, replay_lag
FROM pg_stat_replication;
Expected output:
client_addr | state | sync_state | replay_lag
----------------+-----------+------------+------------
192.168.1.101 | streaming | async | 00:00:00
Healthy replication - state should be "streaming" with minimal lag
Personal tip: "If you see no rows, your replica isn't connected. Check pg_hba.conf and firewall settings."
Test Data Replication
On primary server:
-- Create test table
CREATE DATABASE replication_test;
\c replication_test;
CREATE TABLE test_sync (
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
message TEXT
);
INSERT INTO test_sync (message) VALUES ('Replication test message');
On replica server (different Terminal):
sudo -u postgres psql -d replication_test
-- Check if data appeared (should be instant)
SELECT * FROM test_sync;
Expected output:
id | created_at | message
----+----------------------------+------------------------
1 | 2025-08-21 10:30:45.123456 | Replication test message
Data appeared on replica within 1 second - replication is working perfectly
Personal tip: "If data doesn't appear immediately, check replay_lag in pg_stat_replication. Anything over 5 seconds indicates a problem."
Monitor Replication Lag
Create this monitoring query:
-- On primary server
SELECT
client_addr as replica_ip,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as lag_bytes,
EXTRACT(EPOCH FROM (now() - pg_stat_activity.query_start)) as connection_seconds
FROM pg_stat_replication
JOIN pg_stat_activity ON pg_stat_replication.pid = pg_stat_activity.pid;
Expected output:
replica_ip | lag_bytes | connection_seconds
---------------+-----------+-------------------
192.168.1.101 | 0 bytes | 45.2
Personal tip: "Save this query - I run it daily to catch replication problems early. Lag over 100MB usually means network or performance issues."
Step 5: Test Failover Process
The problem: You need to know failover works before you need it
My solution: Practice promoting replica to primary
Time this saves: Prevents panic during real outages
Simulate Primary Server Failure
On primary server:
sudo systemctl stop postgresql
Promote Replica to Primary
On replica server:
# Promote replica to primary
sudo -u postgres pg_promote -D /var/lib/postgresql/15/main/
Expected output:
server promoting
Wait 10-15 seconds, then verify:
sudo -u postgres psql
-- Check if replica is now accepting writes
SELECT pg_is_in_recovery();
Expected output:
pg_is_in_recovery
-------------------
f
f means false - server is no longer in recovery mode and accepts writes.
Test Write Operations
-- This should work now (would fail before promotion)
INSERT INTO replication_test.test_sync (message) VALUES ('Written after failover');
SELECT * FROM replication_test.test_sync ORDER BY id;
Expected output: Both messages should appear, confirming the replica is now fully functional as primary.
Failover completed - replica is now accepting writes as new primary
Personal tip: "The entire failover took 45 seconds from stopping the primary to having a writable database. That's why this setup saved my business."
Step 6: Production Considerations
The problem: Demo setup isn't production-ready
My solution: Security hardening and monitoring setup
Time this saves: Prevents security incidents and downtime
Secure Your Replication
Replace the basic authentication with certificate-based authentication:
# Generate SSL certificates (on primary)
sudo -u postgres openssl req -new -x509 -days 365 -nodes -text \
-out server.crt -keyout server.key -subj "/CN=postgres-primary"
# Set proper permissions
sudo chmod 600 /var/lib/postgresql/15/main/server.key
sudo chown postgres:postgres /var/lib/postgresql/15/main/server.*
Update pg_hba.conf:
# Use SSL certificate authentication instead of password
hostssl replication replicator 192.168.1.101/32 cert
Personal tip: "Certificate auth is more secure and eliminates password management headaches. Set this up before going to production."
Set Up Monitoring
Create this monitoring script:
#!/bin/bash
# Save as /usr/local/bin/check_replication.sh
LAG_BYTES=$(sudo -u postgres psql -t -c "
SELECT COALESCE(
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn),
0
) FROM pg_stat_replication LIMIT 1;" | xargs)
if [ -z "$LAG_BYTES" ]; then
echo "CRITICAL: No replica connected"
exit 2
elif [ "$LAG_BYTES" -gt 104857600 ]; then # 100MB
echo "WARNING: Replication lag is ${LAG_BYTES} bytes"
exit 1
else
echo "OK: Replication lag is ${LAG_BYTES} bytes"
exit 0
fi
Make it executable and add to cron:
sudo chmod +x /usr/local/bin/check_replication.sh
# Add to crontab - check every 5 minutes
echo "*/5 * * * * /usr/local/bin/check_replication.sh" | sudo crontab -
Personal tip: "This script saved me from a 6-hour replication lag that I didn't notice. Set up alerts to your phone/email."
What You Just Built
You now have a production-ready PostgreSQL streaming replication setup that:
- Automatically syncs data between primary and replica in real-time
- Can failover from primary to replica in under 60 seconds
- Provides read-only access to replica for load balancing
- Includes monitoring to catch problems early
Key Takeaways (Save These)
- Replication lag monitoring: Check
pg_stat_replicationdaily - anything over 100MB lag needs investigation - Failover testing: Practice promoting replica monthly when traffic is low - you need muscle memory for real outages
- Connection limits: Set
max_wal_sendersto replicas + 1, or you'll get mysterious connection failures during maintenance
Your Next Steps
Pick one:
- Beginner: Set up automated backups of your replica using pg_dump for extra safety
- Intermediate: Configure synchronous replication for zero data loss scenarios
- Advanced: Build a 3-node cluster with automatic failover using Patroni
Tools I Actually Use
- pgAdmin: Visual monitoring of replication status - easier than command line for daily checks
- pg_stat_monitor: Better query performance insights - helps identify queries causing replication lag
- Official PostgreSQL Documentation: Streaming Replication Guide - bookmark this for troubleshooting
Remember: This setup protects against hardware failures but not human errors (like accidentally dropping tables). You still need regular backups for complete protection.