Skip to content

CoE: Prometheus HA Split-Brain False Alerts

Field Value
Date 2026-02-08
Severity Low (false positive alert, no data loss)
Duration ~2 weeks (alert noise)
Impact Router-hosts "Container Down" alert firing incorrectly
Resolution Scale Prometheus to single replica (PR #814)

Summary

The Grafana alert rule "Router Hosts Container Down" was firing as a false positive. The router-hosts container on Firewalla was healthy and running, but the alert evaluated as if the container had disappeared.

Root Cause

The cluster ran Prometheus as a 2-replica StatefulSet (pod-0 and pod-1). Each replica maintains an independent TSDB with no cross-replica replication — this is standard Prometheus behavior, not a misconfiguration.

The data flow created a split-brain condition:

Firewalla (cAdvisor)
    → alloy-ingest (remote_write)
        → K8s Service (kps-kube-prometheus-stack-prometheus)
            → load-balanced to pod-1
Grafana alert evaluator
    → K8s Service (prometheus-operated)
        → connected to pod-0 (via connection reuse)
            → no Firewalla data here

The remote_write from alloy-ingest landed on pod-1 because the K8s Service load-balanced the initial connection there. Grafana's alert evaluator, using a persistent connection, consistently queried pod-0. Since pod-0 never received the Firewalla remote_write data, the container_last_seen{name="router-hosts-server", host="firewalla"} metric was absent, and the alert fired.

Why This Wasn't Caught Earlier

  • Scrape-based metrics (node-exporter, kube-state-metrics) appeared on both pods because each replica independently scrapes all targets
  • Only remote_write data (pushed by alloy-ingest) was affected, since it lands on whichever pod the Service routes the connection to
  • The alert was relatively new, so there was no baseline to compare against

Resolution

Scaled Prometheus from 2 replicas to 1 and disabled the PodDisruptionBudget:

  • replicas: 2replicas: 1
  • podDisruptionBudget.enabled: truefalse

With a single replica, all remote_write data and all alert evaluations hit the same TSDB.

Why Not Add Thanos or Dual-Write?

For a homelab cluster, the operational complexity of Thanos (sidecar, store gateway, compactor, object storage) or maintaining dual remote_write targets far outweighs the benefit of Prometheus HA. The single replica is backed by Longhorn persistent storage, so data survives pod restarts. The only trade-off is a brief monitoring gap (~2-5 min) during node drains.

Rollout

  • ArgoCD auto-synced after PR merge
  • StatefulSet scale-down removed pod-1 (highest ordinal)
  • Pod-0 was already running; alloy-ingest reconnected within one push cycle (~60s)
  • Alert resolved to OK within minutes
  • Pod-1's PVC remains orphaned on Longhorn (can be cleaned up manually)

Lessons Learned

  1. Prometheus replicas are not HA for remote_write data. Each replica has an independent TSDB. Without Thanos or similar deduplication, remote_write data lands on a single pod nondeterministically.

  2. K8s Service load balancing + persistent connections = sticky routing. Grafana's evaluator and alloy-ingest each hold long-lived connections, so they can end up pinned to different pods indefinitely.

  3. Scrape-based metrics mask the problem. Both pods independently scrape the same targets, so most metrics appear consistent. The divergence only shows up for pushed data.

Follow-Up

  • [ ] Clean up orphaned PVC from pod-1 (prometheus-kps-kube-prometheus-stack-prometheus-db-prometheus-kps-kube-prometheus-stack-prometheus-1)
  • [ ] Consider documenting the single-replica decision as an ADR if the team revisits HA later