Disaster Recovery

Write-Ahead Log (WAL)

Rivellum uses a write-ahead log with CRC32C checksums for crash recovery. Every state-modifying operation is logged before application.

WAL Configuration

[wal]
wal_dir = "/var/lib/rivellum/wal"
wal_fsync_policy = "strict"    # strict | balanced | dev_fast

Policy	Behavior	Use Case
`strict`	fsync after every write	Production (highest durability)
`balanced`	Periodic fsync	Production (good durability, better throughput)
`dev_fast`	No fsync	Development/testing only

Crash Recovery

On startup, the node automatically replays the WAL from the last fully-written record. Records with bad CRC32C checksums are skipped (they indicate incomplete writes from a crash).

Snapshots

Create a Snapshot

rivellum-node snapshot create \
    --data-dir /var/lib/rivellum \
    --out /backups/snapshot-$(date +%Y%m%d-%H%M%S)

List Snapshots

rivellum-node snapshot list --data-dir /var/lib/rivellum

Restore from Snapshot

# Stop the node first
systemctl stop rivellum-node

# Restore
rivellum-node snapshot restore \
    --from /backups/snapshot-20250101-120000 \
    --data-dir /var/lib/rivellum

# Start the node — it will replay WAL entries after the snapshot
systemctl start rivellum-node

Backup Strategy

Production Recommendations

Automated snapshots: Schedule snapshots every 6-12 hours
Off-site storage: Copy snapshots to remote storage (S3, GCS, etc.)
WAL retention: Keep WAL files for at least 24 hours after snapshot
Test restores: Periodically verify backup integrity by restoring to a standby node

Automated Backup Script

#!/bin/bash
BACKUP_DIR="/backups/rivellum"
DATA_DIR="/var/lib/rivellum"
RETENTION_DAYS=7

# Create snapshot
SNAP_NAME="snapshot-$(date +%Y%m%d-%H%M%S)"
rivellum-node snapshot create --data-dir "$DATA_DIR" --out "$BACKUP_DIR/$SNAP_NAME"

# Clean old snapshots
find "$BACKUP_DIR" -name 'snapshot-*' -mtime +$RETENTION_DAYS -exec rm -rf {} +

State Recovery Scenarios

Scenario	Recovery Method
Process crash	Automatic WAL replay on restart
Data corruption	Restore from most recent snapshot
Hardware failure	Restore snapshot on new hardware, sync from peers
Full resync	Delete data directory, restart node — syncs from genesis

Per-Lane State

Each lane has its own RocksDB instance and WAL partition. During recovery:

Lane states are restored independently
The MetaRoot aggregator recomputes the global root from lane roots
COW snapshots ensure consistent point-in-time state

For monitoring and log configuration, see Logging & Monitoring.