Operational runbook

This page collects the procedures you need when operating Unimeter in production. Each section is a self-contained recipe you can follow under pressure without reading the rest of the documentation first.

Backup

Unimeter stores all durable state in the data directory. A backup is a copy of that directory. The files are append-only or atomically replaced, so you can copy them while the server is running without stopping traffic.

The files that matter are the write-ahead log (wal.log), the segment files (*.seg, *.idx, *.props), the checkpoint (checkpoint.bin), the metric registry (metric_registry.bin), the alert log (alert_log.bin), and the partition map (partition_map.bin). Everything else is derived from these.

rsync -a /var/lib/unimeter/ /backup/unimeter/$(date +%Y%m%d)/

Run this on each node independently. In a three-node cluster you get three independent backups, any two of which are sufficient to restore the full dataset because every committed event exists on at least two nodes.

Restore

Stop the server, replace the data directory with the backup, and start the server again.

systemctl stop unimeter
rm -rf /var/lib/unimeter/*
cp -a /backup/unimeter/20260415/* /var/lib/unimeter/
systemctl start unimeter

On startup Unimeter loads the checkpoint, replays the write-ahead log from the checkpoint offset, and rebuilds the in-memory aggregates. The server is ready to accept traffic once the replay completes, which typically takes a few seconds even for large datasets.

If you are restoring a single node in a running cluster, the node will catch up any events it missed via replication from the other nodes after it rejoins.

Scaling: adding a node

Adding a fourth node to a three-node cluster requires three steps.

First, start the new node with the correct --node-id, --peers, and an empty data directory. The node will connect to its peers but will not own any partitions yet.

unimeter \
  --node-id=3 \
  --port=7001 \
  --data-dir=/var/lib/unimeter \
  --peers=0:node0.internal:8001,1:node1.internal:8002,2:node2.internal:8003

Second, trigger a rebalance from any client. The rebalance command tells the cluster to redistribute partitions across the new node count. The server transfers the relevant event data to the new node automatically.

Third, update your application’s seed list to include the new node address so that new client connections can discover it.

Scaling: removing a node

To remove a node, trigger a rebalance with the reduced node count first. This moves the departing node’s partitions to the remaining nodes. After the rebalance completes, stop the departing node and remove it from the --peers lists of the remaining nodes on their next restart.

Failover

Failover is automatic. When a leader node becomes unreachable, the replicas detect the absence of heartbeats within two seconds and run a view change to elect a new leader. Clients discover the change through redirect responses and update their partition map cache transparently. No manual intervention is required.

You can verify that failover succeeded by checking the billing_view_changes_total Prometheus counter. A single increment per affected partition is normal. Sustained increments indicate flapping, which usually points to network instability between nodes.

Rolling restart

Restart nodes one at a time. Wait for each node to rejoin and catch up before restarting the next.

for node in node0 node1 node2; do
  ssh $node "systemctl restart unimeter"
  sleep 10  # wait for rejoin and catchup
  curl -sf http://$node:9090/health || echo "WARNING: $node not healthy"
done

The PodDisruptionBudget in the Helm chart enforces this automatically in Kubernetes: at least two out of three pods remain available during a rolling update.

Disk space emergency

When free disk space drops below 256 MB, the server stops accepting new events and returns a backpressure status to clients. Existing data is not affected.

To recover, free space by reducing --retention-days and restarting the server. The retention sweep runs hourly and deletes segment files older than the configured threshold. Alternatively, move old segment files to another volume manually.

Monitor billing_wal_offset_bytes growth rate to predict when you will run out of space.

Retention policy

Unimeter deletes segment files and evicts aggregate data older than --retention-days (default 90). The sweep runs once per hour. After deletion, a new checkpoint is written so that the freed memory is not reconstructed on restart.

The alert log is not rotated automatically. It grows at roughly 40 bytes per threshold crossing, which is negligible for most deployments. If it grows large, you can safely truncate it while the server is stopped.

Startup validation

On startup the server runs integrity checks before accepting traffic. It verifies that the data directory is writable, validates the WAL’s CRC chain, checks the checkpoint’s magic number and checksum, and scans segment files for size consistency. Corrupt checkpoints are fatal; the server will not start. WAL corruption is handled by truncating at the last valid entry and logging a warning.

Graceful shutdown

Sending SIGTERM or SIGINT to the server triggers a graceful shutdown. The server stops accepting new connections, waits up to five seconds for in-flight sync writes to complete, saves a final checkpoint, flushes the WAL, and exits. The log message “graceful shutdown complete” confirms that all data was persisted.