Longhorn Replica Migration Plan¶

Overview¶

This document outlines the plan to increase Longhorn storage replication from 2 to 3 replicas for high availability and migrate strict-local volumes to replicated storage.

Current State¶

Storage Class: longhorn-retain configured with numberOfReplicas: "2"
Active Volumes: 32 total volumes
24 volumes with 2 replicas (now upgraded to 3)
1 volume with strict-local (1 replica): teleport/teleport-db-1
7 detached/orphaned volumes with strict-local (can be deleted)

Changes Applied¶

1. Storage Class Update¶

Updated argocd/app-configs/shared-resources/longhorn-storage-classes.yaml
Changed numberOfReplicas from "2" to "3"
New PVCs will automatically get 3 replicas

2. Existing Volume Updates¶

Successfully updated 24 volumes from 2 to 3 replicas
Volumes are replicating to third replica automatically via Longhorn

Strict-Local Volume Migration¶

Active Volume: teleport/teleport-db-1¶

Current Configuration: - PVC: teleport/teleport-db-1 - Volume: pvc-aa2d6a02-4fdd-4262-8b0d-d8fa79010f46 - StorageClass: postgres-storage - Data Locality: strict-local (manually set) - Replicas: 1

Source: - Managed by CloudNativePG Cluster: teleport/teleport-db - PostgreSQL database for Teleport - Single instance database (instances: 1)

Migration Strategy:

The Teleport PostgreSQL volume was manually configured with strict-local data locality, overriding the StorageClass default of best-effort. To migrate to 3 replicas:

Option A: Update Volume Directly (Recommended)¶

Patch the volume to change dataLocality from strict-local to best-effort
Patch the volume to set numberOfReplicas to 3
Longhorn will automatically create additional replicas
Minimal downtime (volume remains attached)

# Step 1: Change data locality
kubectl --context fzymgc-house patch volume.longhorn.io pvc-aa2d6a02-4fdd-4262-8b0d-d8fa79010f46 \
  -n longhorn-system --type='json' \
  -p='[{"op": "replace", "path": "/spec/dataLocality", "value": "best-effort"}]'

# Step 2: Increase replicas
kubectl --context fzymgc-house patch volume.longhorn.io pvc-aa2d6a02-4fdd-4262-8b0d-d8fa79010f46 \
  -n longhorn-system --type='json' \
  -p='[{"op": "replace", "path": "/spec/numberOfReplicas", "value": 3}]'

# Step 3: Monitor replication progress
kubectl --context fzymgc-house get volume.longhorn.io pvc-aa2d6a02-4fdd-4262-8b0d-d8fa79010f46 \
  -n longhorn-system -o jsonpath='{.status.robustness}'

Option B: Recreate Volume (Higher Risk)¶

Create backup of PostgreSQL database
Delete the PVC and allow CloudNativePG to recreate
Restore from backup
Higher downtime and complexity

Recommendation: Use Option A - direct volume update with minimal downtime.

Orphaned Volumes (Cleanup)¶

These 7 detached volumes are from deleted PVCs and can be safely removed:

pvc-2c1f9d32-12d6-4e69-b5d2-f01c1591ccc6
pvc-3429f8d6-4d27-44ca-a7c2-35d33862025b
pvc-41babba2-1a4f-469a-bf7e-f4a2-1b92152
pvc-4aaaf630-036c-4914-ad0c-d9d83d567cea
pvc-b143c9bc-4cfe-41cc-bf2c-4dda0d8634a8
pvc-db8b4b70-1fb7-43c3-a9ee-287bcbde8a73
pvc-ee7385a2-f3e7-417d-b15a-38f6ce7dc81f

Cleanup Command:

for vol in pvc-2c1f9d32-12d6-4e69-b5d2-f01c1591ccc6 \
           pvc-3429f8d6-4d27-44ca-a7c2-35d33862025b \
           pvc-41babba2-1a4f-469a-bf7e-f4a2-1b92152 \
           pvc-4aaaf630-036c-4914-ad0c-d9d83d567cea \
           pvc-b143c9bc-4cfe-41cc-bf2c-4dda0d8634a8 \
           pvc-db8b4b70-1fb7-43c3-a9ee-287bcbde8a73 \
           pvc-ee7385a2-f3e7-417d-b15a-38f6ce7dc81f; do
  kubectl --context fzymgc-house delete volume.longhorn.io $vol -n longhorn-system
done

Storage Class Updates Required¶

Update argocd/app-configs/cnpg/storageclass-postgres.yaml to match new replica count:

parameters:
  numberOfReplicas: "3"  # Changed from "2"

Verification Steps¶

After migration:

Verify storage class replica counts:

kubectl --context fzymgc-house get storageclass -o json | \
  jq -r '.items[] | select(.provisioner == "driver.longhorn.io") | "\(.metadata.name): replicas=\(.parameters.numberOfReplicas)"'

Verify all active volumes have 3 replicas:

kubectl --context fzymgc-house get volumes.longhorn.io -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.state == "attached") | "\(.metadata.name): replicas=\(.spec.numberOfReplicas), locality=\(.spec.dataLocality)"'

Check volume health:

kubectl --context fzymgc-house get volumes.longhorn.io -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.state == "attached") | "\(.metadata.name): \(.status.robustness)"'

Rollback Plan¶

If issues occur:

Revert storage class changes:

git checkout main -- argocd/app-configs/shared-resources/longhorn-storage-classes.yaml
git checkout main -- argocd/app-configs/cnpg/storageclass-postgres.yaml

Reduce volume replicas:

# This will NOT delete existing data, just reduce replica count
kubectl --context fzymgc-house patch volume.longhorn.io <volume-name> \
  -n longhorn-system --type='json' \
  -p='[{"op": "replace", "path": "/spec/numberOfReplicas", "value": 2}]'

Timeline¶

Immediate: Storage class updates committed and merged (this PR)
Post-merge: Update postgres-storage StorageClass via GitOps
Scheduled maintenance window: Migrate teleport-db-1 volume (Option A)
After verification: Cleanup orphaned volumes

Risks and Mitigation¶

Risk: Volume replication consumes additional disk space - Mitigation: Monitor disk usage, current cluster has sufficient capacity

Risk: Replica creation may impact I/O performance during sync - Mitigation: Longhorn throttles rebuild, minimal impact expected

Risk: Teleport database downtime during migration - Mitigation: Option A provides minimal downtime, Option B has backup/restore path

Success Criteria¶

All storage classes configured with 3 replicas
All active volumes have 3 replicas with best-effort data locality
No orphaned/detached volumes remaining
Volume health status shows healthy for all volumes
No service interruptions for applications