CoE: Prometheus HA Split-Brain False Alerts¶
| Field | Value |
|---|---|
| Date | 2026-02-08 |
| Severity | Low (false positive alert, no data loss) |
| Duration | ~2 weeks (alert noise) |
| Impact | Router-hosts "Container Down" alert firing incorrectly |
| Resolution | Scale Prometheus to single replica (PR #814) |
Summary¶
The Grafana alert rule "Router Hosts Container Down" was firing as a false positive. The router-hosts container on Firewalla was healthy and running, but the alert evaluated as if the container had disappeared.
Root Cause¶
The cluster ran Prometheus as a 2-replica StatefulSet (pod-0 and pod-1). Each replica maintains an independent TSDB with no cross-replica replication — this is standard Prometheus behavior, not a misconfiguration.
The data flow created a split-brain condition:
Firewalla (cAdvisor)
→ alloy-ingest (remote_write)
→ K8s Service (kps-kube-prometheus-stack-prometheus)
→ load-balanced to pod-1
Grafana alert evaluator
→ K8s Service (prometheus-operated)
→ connected to pod-0 (via connection reuse)
→ no Firewalla data here
The remote_write from alloy-ingest landed on pod-1 because the K8s Service load-balanced the initial connection there. Grafana's alert evaluator, using a persistent connection, consistently queried pod-0. Since pod-0 never received the Firewalla remote_write data, the container_last_seen{name="router-hosts-server", host="firewalla"} metric was absent, and the alert fired.
Why This Wasn't Caught Earlier¶
- Scrape-based metrics (node-exporter, kube-state-metrics) appeared on both pods because each replica independently scrapes all targets
- Only
remote_writedata (pushed by alloy-ingest) was affected, since it lands on whichever pod the Service routes the connection to - The alert was relatively new, so there was no baseline to compare against
Resolution¶
Scaled Prometheus from 2 replicas to 1 and disabled the PodDisruptionBudget:
replicas: 2→replicas: 1podDisruptionBudget.enabled: true→false
With a single replica, all remote_write data and all alert evaluations hit the same TSDB.
Why Not Add Thanos or Dual-Write?¶
For a homelab cluster, the operational complexity of Thanos (sidecar, store gateway, compactor, object storage) or maintaining dual remote_write targets far outweighs the benefit of Prometheus HA. The single replica is backed by Longhorn persistent storage, so data survives pod restarts. The only trade-off is a brief monitoring gap (~2-5 min) during node drains.
Rollout¶
- ArgoCD auto-synced after PR merge
- StatefulSet scale-down removed pod-1 (highest ordinal)
- Pod-0 was already running; alloy-ingest reconnected within one push cycle (~60s)
- Alert resolved to OK within minutes
- Pod-1's PVC remains orphaned on Longhorn (can be cleaned up manually)
Lessons Learned¶
-
Prometheus replicas are not HA for
remote_writedata. Each replica has an independent TSDB. Without Thanos or similar deduplication,remote_writedata lands on a single pod nondeterministically. -
K8s Service load balancing + persistent connections = sticky routing. Grafana's evaluator and alloy-ingest each hold long-lived connections, so they can end up pinned to different pods indefinitely.
-
Scrape-based metrics mask the problem. Both pods independently scrape the same targets, so most metrics appear consistent. The divergence only shows up for pushed data.
Follow-Up¶
- [ ] Clean up orphaned PVC from pod-1 (
prometheus-kps-kube-prometheus-stack-prometheus-db-prometheus-kps-kube-prometheus-stack-prometheus-1) - [ ] Consider documenting the single-replica decision as an ADR if the team revisits HA later