Running a cluster

Running Unimeter in production means running more than one node. A single node is fine for development and staging, but for the real thing you want the guarantees that come from replication. This page covers what a production cluster looks like, how to operate it, and what to watch.

Why run multiple nodes

Unimeter is the system of record for how much your customers used your product, which means the cost of losing events is real money. A single server can lose its events if the disk fails, if the kernel panics at the wrong moment, or if the VM it runs on is terminated. None of these are rare in practice over a long enough timeline.

A cluster of three or more nodes solves this by writing every event to a quorum of nodes before acknowledging the client. If one node dies, the remaining two have the data and keep serving requests. When the dead node comes back, it catches up from the others. At no point does a committed event vanish, even if the timing of the failure is unlucky.

Three is the practical minimum because replication requires a majority to acknowledge. With two nodes a single failure leaves you without a majority and writes stall. Five or seven nodes gives you more failure tolerance at the cost of higher replication overhead, but three is the sweet spot for most deployments.

What nodes do

Every node in a Unimeter cluster runs the same binary with the same code. There is no leader process separate from worker processes, no coordinator that has to be running for others to work. Each node is responsible for a slice of the accounts and serves writes and reads for that slice directly. When a node receives a write for an account it does not own, it replicates the write to the owner and acknowledges only after the owner confirms.

Account ownership is determined by hashing the account identifier and taking the remainder when divided by the number of partitions. A cluster has 256 partitions by default, and these are distributed across the nodes evenly. With three nodes each owns about 85 partitions. When a node dies, ownership of its partitions transfers to surviving nodes automatically, and clients learn about the change on the next request.

This design means the Go client never talks to a proxy or a coordinator. It holds a cache of which node owns each partition and sends each request directly to the right node. The cache updates automatically when ownership changes. This is why your application code does not need any retry or reconnection logic; it is all in the client library.

A production layout

A typical production cluster is three Unimeter nodes running on three separate machines. Each node has its own disk for data, its own network interface, and its own Prometheus scrape target. The nodes are configured with each other’s addresses so they can replicate events. From outside the cluster, your application’s Unimeter clients are given all three addresses as seeds, so they can connect to whichever is reachable.

The binary listens on two TCP ports. The application port, default 7001, speaks the binary protocol that the Go client uses. The replication port, which is the application port plus 1000, is for inter-node traffic. If your nodes are on a private network, only the application port needs to be exposed to the outside world. The replication port stays internal.

An additional HTTP port, default 9090, serves health checks at /health and Prometheus metrics at /metrics. Put this behind your internal monitoring infrastructure. The HTTP port is not for application traffic; your application should always use the binary protocol through the Go client for ingest and query.

Starting nodes

Each node is the same unimeter binary started with flags that tell it who it is and who its peers are. You pick a node identifier, a data directory for its local state, the TCP port for the application protocol, and a list of peer addresses for replication.

unimeter \
  --node-id=0 \
  --port=7001 \
  --data-dir=/var/lib/unimeter \
  --peers=1:node1.internal:8002,2:node2.internal:8003

The replication port is conventionally the application port plus 1000, so node 0 listening on 7001 replicates on 8001. Each peer entry takes the form node_id:host:replication_port. Start the same binary on each machine with its own node identifier and data directory, and the cluster forms automatically once the processes can reach each other.

Run the binary under a supervisor that restarts it on exit. Systemd, Kubernetes, or whichever orchestrator you normally use is fine. Unimeter has no special requirements beyond a writable data directory and network reachability to its peers.

Monitoring

Every node exposes a Prometheus scrape endpoint and a health check endpoint on its HTTP port, default 9090. Point your existing monitoring infrastructure at each node and the rest is normal Prometheus. The full list of metrics, what they mean, and suggested alerts is on the Metrics and observability page.

Handling failures

A node crash is not a crisis. The cluster continues to serve writes and reads, and clients do not need to be restarted. The surviving nodes take over the dead node’s partitions and keep working. When the crashed node restarts, it rejoins the cluster and catches up from its peers.

A disk failure on one node is similarly contained. The cluster continues serving traffic with the two remaining nodes. When you replace the disk, the new node resyncs its state from its peers. This works because every event is replicated across a majority of the cluster before acknowledgement, so losing one copy never loses the event.

A network partition that splits the cluster is the trickiest case. If the partition leaves a majority on one side, that side keeps working and the minority side is unable to commit new writes. When connectivity is restored the minority side catches up. If the partition is a clean split into two equal groups, neither side has a majority and writes stall until the partition heals. This is a safety feature, because allowing both sides to continue accepting writes would lead to divergent state that cannot be reconciled.

Upgrading

Upgrades are done one node at a time. Stop a node, deploy the new binary, start the node, wait for it to rejoin and catch up, then move to the next. The cluster continues serving traffic during the whole process because the other nodes are still up. There is no requirement to run all nodes at the same version simultaneously; a mixed-version cluster is supported as long as the new version is at most one minor version ahead of the old.

Before you upgrade, check the release notes for any specific migration instructions. Durable state on disk is forward-compatible within a major version, so you should not need to export and re-import data for routine upgrades.

What is not yet in place

A few operational conveniences are on the roadmap. Node rebalancing, meaning adding a node and having partitions automatically redistribute, requires a manual partition-map update today. Authentication using API keys is not yet enforced by the server, which means anyone who can reach the application port can send events. Rate limiting by customer is similarly unimplemented. Treat Unimeter the way you would treat an internal Postgres: put it behind your own edge and use network policy to control who can reach it.

What comes next

If you are curious about how Unimeter achieves its throughput and correctness claims, read How it stays fast for the design rationale.