Skip to content

Metrics and observability

Unimeter is opinionated about observability. Every node exposes a Prometheus scrape endpoint with the counters, gauges, and histograms needed to answer the questions that matter in production. The metric set is small on purpose. Instead of dozens of numbers that nobody reads, you get the dozen or so that actually tell you whether the system is healthy and where it hurts when it is not.

Every Unimeter node listens on a separate HTTP port, default 9090, for monitoring traffic. Two endpoints are exposed. Health checks live at /health and return a small JSON object with a status field and the process uptime in seconds. Prometheus metrics live at /metrics and return the standard text exposition format. Both endpoints are cheap to hit and safe to scrape as often as your collector prefers.

Point your Prometheus server at each Unimeter node’s HTTP port with a normal scrape job. A fifteen-second scrape interval works well for most deployments. Once the metrics are in Prometheus you can visualise them with whichever dashboarding tool your organisation already uses. Grafana is the most common choice, but the metrics are plain and well-documented Prometheus primitives, so any compatible backend works.

The headline number is how many events are being ingested. billing_events_ingested_total is a counter with a mode label that is either async or sync depending on the delivery mode the client requested. Taking the per-second rate over this counter gives you events per second. The async side should be the overwhelming majority of traffic in most deployments; a sudden shift to sync usually means a client library change somewhere worth investigating.

billing_events_duplicate_total counts events that the deduplication ring rejected because it had seen the same content recently. A small trickle of duplicates is normal and comes from client retries after timeouts. A large jump usually means a client is stuck in a retry loop, and the number here is how you notice.

Three histograms describe how long operations take, all measured in seconds. billing_ingest_async_duration_seconds tracks the time from request arrival to async acknowledgement. This is typically well under a millisecond because async does not wait for disk. billing_ingest_sync_duration_seconds tracks the sync path and includes the fsync and replica acknowledgement, so it is slower by a few milliseconds at p50. billing_wal_sync_duration_seconds isolates the fsync alone, which is useful for diagnosing when slow disk is the culprit behind a latency regression.

Use these with Prometheus histogram_quantile to compute p50, p95, and p99 over any window. Reasonable alert thresholds are p99 under 10 milliseconds for async ingest and under 50 milliseconds for sync ingest. Sustained breaches of either usually mean either disk pressure or network congestion between replicas.

billing_wal_writes_total counts the number of write-ahead log append operations completed. Each ingest batch produces one append, so this is roughly proportional to ingest rate divided by batch size. billing_wal_syncs_total counts fsync completions, which happen for sync deliveries and at periodic flush intervals. billing_wal_offset_bytes is a gauge showing the current write offset in the log, which grows monotonically as events accumulate on disk.

Watching billing_wal_offset_bytes per node tells you how fast you are burning through disk space. Compare the growth rate to available space and you have a rough runway estimate.

Two gauges track active connections. billing_connections_active is the number of open TCP connections on the binary protocol port, which is where application clients land. billing_http_connections_active is the number of open connections on the HTTP port, which is where monitoring tools land. Both should be a small number relative to your client pool size. A steady climb suggests a client is leaking connections and not closing cleanly.

billing_view_changes_total counts the number of times a node became leader of a partition. Every node increments its own counter when it takes over. In a healthy stable cluster this counter barely moves after startup, because leaders do not churn. A sudden increase is how you learn that something is making nodes flap, whether that is network problems, disk stalls, or a crash loop.

For clusters with active-passive replication patterns you would also watch for asymmetry. If one node’s view change counter is stuck at zero while the others keep climbing, that node is probably unreachable from a quorum of peers.

Three metrics describe the live alert push flow. billing_alert_subscribers is a gauge of how many TCP connections currently have push enabled on this node. billing_alerts_recorded_total counts every threshold crossing appended to the alert log, including crossings that replicas record for durability. billing_alerts_pushed_total counts frames queued for live subscribers, which only the leader increments.

In a healthy system the ratio of pushed to recorded should be about one per subscriber on the leader side. If recorded keeps growing while pushed stays flat, your subscribers may have dropped off without reconnecting. If subscribers is zero when you expect it to be non-zero, either no client is subscribed or the nodes are serving connections that never sent ALERT_PUSH_ENABLE.

These are the full metric names for easy copy-paste into Prometheus queries.

NameTypeWhat it means
billing_events_ingested_total{mode}counterEvents successfully persisted. mode is async or sync.
billing_events_duplicate_totalcounterEvents rejected because the dedup ring already had them.
billing_wal_writes_totalcounterWAL append operations completed.
billing_wal_syncs_totalcounterWAL fsync operations completed.
billing_view_changes_totalcounterTimes this node became leader of some partition.
billing_connections_activegaugeOpen TCP connections on the binary protocol port.
billing_http_connections_activegaugeOpen connections on the HTTP monitoring port.
billing_wal_offset_bytesgaugeCurrent WAL write offset in bytes.
billing_alert_subscribersgaugeConnections with live alert push enabled.
billing_alerts_recorded_totalcounterThreshold crossings durably appended to the alert log.
billing_alerts_pushed_totalcounterAlert frames queued for live subscribers (leader only).
billing_ingest_async_duration_secondshistogramAsync ingest latency in seconds.
billing_ingest_sync_duration_secondshistogramSync ingest latency in seconds.
billing_wal_sync_duration_secondshistogramWAL fsync latency in seconds, isolated from the ingest path.

A minimal alerting setup focuses on the signals that predict outages or user pain. A page-worthy alert fires when sync ingest p99 crosses 100 milliseconds for more than five minutes, because something is wrong with either disk or replication. A warning alert fires when billing_view_changes_total increases by more than three in a fifteen-minute window, because that points to a flapping cluster. A separate warning alert watches billing_connections_active against a static ceiling you choose based on your expected client pool, to catch runaway connection growth.

These are starting points. Tune the thresholds to what your infrastructure actually does on a good day, then alert on deviations from that baseline rather than absolute numbers.

A Prometheus job for a three-node Unimeter cluster looks like the following. Adjust the targets to match your deployment.

scrape_configs:
- job_name: unimeter
scrape_interval: 15s
static_configs:
- targets:
- node0.internal:9090
- node1.internal:9090
- node2.internal:9090

That is the entire integration. The metrics names do not clash with any other common exporter, so no relabeling is required.

To see how these metrics connect to the overall design, read How it stays fast. If you are still getting a cluster up, see Running a cluster.