Metrics and observability

Unimeter is opinionated about observability. Every node exposes a Prometheus scrape endpoint with the counters, gauges, and histograms needed to answer the questions that matter in production. The metric set is small on purpose. Instead of dozens of numbers that nobody reads, you get the dozen or so that actually tell you whether the system is healthy and where it hurts when it is not.

Where metrics come from

Every Unimeter node listens on a separate HTTP port, default 9090, for monitoring traffic. Two endpoints are exposed. Health checks live at /health and return a small JSON object with a status field and the process uptime in seconds. Prometheus metrics live at /metrics and return the standard text exposition format. Both endpoints are cheap to hit and safe to scrape as often as your collector prefers.

Point your Prometheus server at each Unimeter node’s HTTP port with a normal scrape job. A fifteen-second scrape interval works well for most deployments. Once the metrics are in Prometheus you can visualise them with whichever dashboarding tool your organisation already uses. Grafana is the most common choice, but the metrics are plain and well-documented Prometheus primitives, so any compatible backend works.

What the metrics tell you

Traffic volume

The headline number is how many events are being ingested. billing_events_ingested_total is a counter with a mode label that is either async or sync depending on the delivery mode the client requested. Taking the per-second rate over this counter gives you events per second. The async side should be the overwhelming majority of traffic in most deployments; a sudden shift to sync usually means a client library change somewhere worth investigating.

billing_events_duplicate_total counts events that the deduplication ring rejected because it had seen the same content recently. A small trickle of duplicates is normal and comes from client retries after timeouts. A large jump usually means a client is stuck in a retry loop, and the number here is how you notice.

Latency

Three histograms describe how long operations take, all measured in seconds. billing_ingest_async_duration_seconds tracks the time from request arrival to async acknowledgement. This is typically well under a millisecond because async does not wait for disk. billing_ingest_sync_duration_seconds tracks the sync path and includes the fsync and replica acknowledgement, so it is slower by a few milliseconds at p50. billing_wal_sync_duration_seconds isolates the fsync alone, which is useful for diagnosing when slow disk is the culprit behind a latency regression.

Use these with Prometheus histogram_quantile to compute p50, p95, and p99 over any window. Reasonable alert thresholds are p99 under 10 milliseconds for async ingest and under 50 milliseconds for sync ingest. Sustained breaches of either usually mean either disk pressure or network congestion between replicas.

Storage activity

billing_wal_writes_total counts the number of write-ahead log append operations completed. Each ingest batch produces one append, so this is roughly proportional to ingest rate divided by batch size. billing_wal_syncs_total counts fsync completions, which happen for sync deliveries and at periodic flush intervals. billing_wal_offset_bytes is a gauge showing the current write offset in the log, which grows monotonically as events accumulate on disk.

Watching billing_wal_offset_bytes per node tells you how fast you are burning through disk space. Compare the growth rate to available space and you have a rough runway estimate.

Connection counts

Two gauges track active connections. billing_connections_active is the number of open TCP connections on the binary protocol port, which is where application clients land. billing_http_connections_active is the number of open connections on the HTTP port, which is where monitoring tools land. Both should be a small number relative to your client pool size. A steady climb suggests a client is leaking connections and not closing cleanly.

Cluster health

billing_view_changes_total counts the number of times a node became leader of a partition. Every node increments its own counter when it takes over. In a healthy stable cluster this counter barely moves after startup, because leaders do not churn. A sudden increase is how you learn that something is making nodes flap, whether that is network problems, disk stalls, or a crash loop.

For clusters with active-passive replication patterns you would also watch for asymmetry. If one node’s view change counter is stuck at zero while the others keep climbing, that node is probably unreachable from a quorum of peers.

Alert subscriptions

Three metrics describe the live alert push flow. billing_alert_subscribers is a gauge of how many TCP connections currently have push enabled on this node. billing_alerts_recorded_total counts every threshold crossing appended to the alert log, including crossings that replicas record for durability. billing_alerts_pushed_total counts frames queued for live subscribers, which only the leader increments.

In a healthy system the ratio of pushed to recorded should be about one per subscriber on the leader side. If recorded keeps growing while pushed stays flat, your subscribers may have dropped off without reconnecting. If subscribers is zero when you expect it to be non-zero, either no client is subscribed or the nodes are serving connections that never sent ALERT_PUSH_ENABLE.

Quick reference

These are the full metric names for easy copy-paste into Prometheus queries.

Name	Type	What it means
`billing_events_ingested_total{mode}`	counter	Events successfully persisted. `mode` is `async` or `sync`.
`billing_events_duplicate_total`	counter	Events rejected because the dedup ring already had them.
`billing_wal_writes_total`	counter	WAL append operations completed.
`billing_wal_syncs_total`	counter	WAL fsync operations completed.
`billing_view_changes_total`	counter	Times this node became leader of some partition.
`billing_connections_active`	gauge	Open TCP connections on the binary protocol port.
`billing_http_connections_active`	gauge	Open connections on the HTTP monitoring port.
`billing_wal_offset_bytes`	gauge	Current WAL write offset in bytes.
`billing_alert_subscribers`	gauge	Connections with live alert push enabled.
`billing_alerts_recorded_total`	counter	Threshold crossings durably appended to the alert log.
`billing_alerts_pushed_total`	counter	Alert frames queued for live subscribers (leader only).
`billing_ingest_async_duration_seconds`	histogram	Async ingest latency in seconds.
`billing_ingest_sync_duration_seconds`	histogram	Sync ingest latency in seconds.
`billing_wal_sync_duration_seconds`	histogram	WAL fsync latency in seconds, isolated from the ingest path.

Suggested alerts

A minimal alerting setup focuses on the signals that predict outages or user pain. A page-worthy alert fires when sync ingest p99 crosses 100 milliseconds for more than five minutes, because something is wrong with either disk or replication. A warning alert fires when billing_view_changes_total increases by more than three in a fifteen-minute window, because that points to a flapping cluster. A separate warning alert watches billing_connections_active against a static ceiling you choose based on your expected client pool, to catch runaway connection growth.

These are starting points. Tune the thresholds to what your infrastructure actually does on a good day, then alert on deviations from that baseline rather than absolute numbers.

A minimal scrape configuration

A Prometheus job for a three-node Unimeter cluster looks like the following. Adjust the targets to match your deployment.

scrape_configs:
  - job_name: unimeter
    scrape_interval: 15s
    static_configs:
      - targets:
          - node0.internal:9090
          - node1.internal:9090
          - node2.internal:9090

That is the entire integration. The metrics names do not clash with any other common exporter, so no relabeling is required.

What comes next

To see how these metrics connect to the overall design, read How it stays fast. If you are still getting a cluster up, see Running a cluster.