CoE: Prometheus HA Split-Brain False Alerts¶

Field	Value
Date	2026-02-08
Severity	Low (false positive alert, no data loss)
Duration	~2 weeks (alert noise)
Impact	Router-hosts "Container Down" alert firing incorrectly
Resolution	Scale Prometheus to single replica (PR #814)

Summary¶

The Grafana alert rule "Router Hosts Container Down" was firing as a false positive. The router-hosts container on Firewalla was healthy and running, but the alert evaluated as if the container had disappeared.

Root Cause¶

The cluster ran Prometheus as a 2-replica StatefulSet (pod-0 and pod-1). Each replica maintains an independent TSDB with no cross-replica replication — this is standard Prometheus behavior, not a misconfiguration.

The data flow created a split-brain condition:

Firewalla (cAdvisor)
    → alloy-ingest (remote_write)
        → K8s Service (kps-kube-prometheus-stack-prometheus)
            → load-balanced to pod-1

Grafana alert evaluator
    → K8s Service (prometheus-operated)
        → connected to pod-0 (via connection reuse)
            → no Firewalla data here

The remote_write from alloy-ingest landed on pod-1 because the K8s Service load-balanced the initial connection there. Grafana's alert evaluator, using a persistent connection, consistently queried pod-0. Since pod-0 never received the Firewalla remote_write data, the container_last_seen{name="router-hosts-server", host="firewalla"} metric was absent, and the alert fired.

Why This Wasn't Caught Earlier¶

Scrape-based metrics (node-exporter, kube-state-metrics) appeared on both pods because each replica independently scrapes all targets
Only remote_write data (pushed by alloy-ingest) was affected, since it lands on whichever pod the Service routes the connection to
The alert was relatively new, so there was no baseline to compare against

Resolution¶

Scaled Prometheus from 2 replicas to 1 and disabled the PodDisruptionBudget:

replicas: 2 → replicas: 1
podDisruptionBudget.enabled: true → false

With a single replica, all remote_write data and all alert evaluations hit the same TSDB.

Why Not Add Thanos or Dual-Write?¶

For a homelab cluster, the operational complexity of Thanos (sidecar, store gateway, compactor, object storage) or maintaining dual remote_write targets far outweighs the benefit of Prometheus HA. The single replica is backed by Longhorn persistent storage, so data survives pod restarts. The only trade-off is a brief monitoring gap (~2-5 min) during node drains.

Rollout¶

ArgoCD auto-synced after PR merge
StatefulSet scale-down removed pod-1 (highest ordinal)
Pod-0 was already running; alloy-ingest reconnected within one push cycle (~60s)
Alert resolved to OK within minutes
Pod-1's PVC remains orphaned on Longhorn (can be cleaned up manually)

Lessons Learned¶

Prometheus replicas are not HA for remote_write data. Each replica has an independent TSDB. Without Thanos or similar deduplication, remote_write data lands on a single pod nondeterministically.
K8s Service load balancing + persistent connections = sticky routing. Grafana's evaluator and alloy-ingest each hold long-lived connections, so they can end up pinned to different pods indefinitely.
Scrape-based metrics mask the problem. Both pods independently scrape the same targets, so most metrics appear consistent. The divergence only shows up for pushed data.

Follow-Up¶

[ ] Clean up orphaned PVC from pod-1 (prometheus-kps-kube-prometheus-stack-prometheus-db-prometheus-kps-kube-prometheus-stack-prometheus-1)
[ ] Consider documenting the single-replica decision as an ADR if the team revisits HA later