Disaster Recovery
Write-Ahead Log (WAL)
Rivellum uses a write-ahead log with CRC32C checksums for crash recovery. Every state-modifying operation is logged before application.
WAL Configuration
[wal]
wal_dir = "/var/lib/rivellum/wal"
wal_fsync_policy = "strict" # strict | balanced | dev_fast
| Policy | Behavior | Use Case |
|---|---|---|
strict | fsync after every write | Production (highest durability) |
balanced | Periodic fsync | Production (good durability, better throughput) |
dev_fast | No fsync | Development/testing only |
Crash Recovery
On startup, the node automatically replays the WAL from the last fully-written record. Records with bad CRC32C checksums are skipped (they indicate incomplete writes from a crash).
Snapshots
Create a Snapshot
rivellum-node snapshot create \
--data-dir /var/lib/rivellum \
--out /backups/snapshot-$(date +%Y%m%d-%H%M%S)
List Snapshots
rivellum-node snapshot list --data-dir /var/lib/rivellum
Restore from Snapshot
# Stop the node first
systemctl stop rivellum-node
# Restore
rivellum-node snapshot restore \
--from /backups/snapshot-20250101-120000 \
--data-dir /var/lib/rivellum
# Start the node — it will replay WAL entries after the snapshot
systemctl start rivellum-node
Backup Strategy
Production Recommendations
- Automated snapshots: Schedule snapshots every 6-12 hours
- Off-site storage: Copy snapshots to remote storage (S3, GCS, etc.)
- WAL retention: Keep WAL files for at least 24 hours after snapshot
- Test restores: Periodically verify backup integrity by restoring to a standby node
Automated Backup Script
#!/bin/bash
BACKUP_DIR="/backups/rivellum"
DATA_DIR="/var/lib/rivellum"
RETENTION_DAYS=7
# Create snapshot
SNAP_NAME="snapshot-$(date +%Y%m%d-%H%M%S)"
rivellum-node snapshot create --data-dir "$DATA_DIR" --out "$BACKUP_DIR/$SNAP_NAME"
# Clean old snapshots
find "$BACKUP_DIR" -name 'snapshot-*' -mtime +$RETENTION_DAYS -exec rm -rf {} +
State Recovery Scenarios
| Scenario | Recovery Method |
|---|---|
| Process crash | Automatic WAL replay on restart |
| Data corruption | Restore from most recent snapshot |
| Hardware failure | Restore snapshot on new hardware, sync from peers |
| Full resync | Delete data directory, restart node — syncs from genesis |
Per-Lane State
Each lane has its own RocksDB instance and WAL partition. During recovery:
- Lane states are restored independently
- The MetaRoot aggregator recomputes the global root from lane roots
- COW snapshots ensure consistent point-in-time state
For monitoring and log configuration, see Logging & Monitoring.