Infrastructure Upgrade Plan - November 2025¶
Created: 2025-11-01 Status: In Progress Estimated Completion: 7 weeks from start
Overview¶
This document tracks the systematic upgrade of all infrastructure components across Ansible, ArgoCD, and Terraform managed resources.
Upgrade Principles¶
- One component at a time
- Test after each change
- Create separate branch for each component
- Monitor for 24-48 hours before proceeding to next component
- Full backup before each phase
Phase 1: Foundation & Prerequisites ✅ COMPLETED¶
Pre-Flight Checklist¶
- [ ] Full Velero backup created
- [ ] Current state documented
- [ ] Vault health verified
- [ ] Rollback procedures tested
1.1 Ansible Collections Update¶
Status: ⏳ Not Started
Branch: upgrade/ansible-collections
Files: ansible/requirements-ansible.yml
Current: kubernetes.core >=5.0.0
Target: kubernetes.core >=6.2.0
1.2 cert-manager Update¶
Status: ✅ Completed
Branch: upgrade/cert-manager-v1.19.1
PR: #34
Files: ansible/roles/k3sup/tasks/cert-manager.yml
Current: v1.18.2
Target: v1.19.1
Notes:
- Skip v1.19.0 due to certificate re-issue bug
- Go directly to v1.19.1
- Wait 10 minutes after upgrade
- Verify all certificates: kubectl get certificates -A
- Verify certificate requests: kubectl get certificaterequests -A
Rollback:
# Revert chart_version to v1.18.2 in ansible/roles/k3sup/tasks/cert-manager.yml
ansible-playbook -i inventory/hosts.yml k3s-playbook.yml --tags cert-manager
1.3 External Secrets Operator Update¶
Status: ✅ Completed
Branch: upgrade/external-secrets-v0.20.4
PR: #36
Files: ansible/roles/k3sup/tasks/external-secrets-operator.yml
Current: 0.19.2
Target: 0.20.4
1.4 MetalLB Update¶
Status: ✅ Completed
Branch: upgrade/metallb-v0.15.2
PR: #37
Files:
- ansible/roles/k3sup/tasks/metallb.yml
- argocd/cluster-app/templates/metallb.yaml
Current: v0.14.9
Target: v0.15.2
Phase 2: Core Infrastructure (Week 2) ✅ CURRENT PHASE¶
2.1 ArgoCD Update¶
Status: ✅ Completed
Branch: upgrade/argocd-v9.0.5
PR: #38
Current: 8.3.0
Target: 9.0.5
Priority: 🟡 Medium
Notes: Review breaking changes in 9.x release notes
2.2 Prometheus CRDs Update¶
Status: 🔄 In Progress
Branch: upgrade/prometheus-crds-v79.1.0
Files: ansible/roles/k3sup/tasks/prometheus-crds.yml
Current: 23.0.0
Target: 79.1.0
Priority: 🟡 Medium
Notes: MUST update before kube-prometheus-stack
2.3 Longhorn Update¶
Status: 🔄 In Progress
Branch: upgrade/longhorn-v1.10.1
PR: (pending - waiting for backup verification)
Files: ansible/roles/k3sup/tasks/longhorn.yml
Current: 1.9.1
Target: 1.10.1
Priority: 🔴 High
Notes:
- Manual CR migration required
- Skip v1.10.0 (critical bug)
- Use hotfixed image if needed
- Allow 30-60 minutes for completion
- ✅ Pre-upgrade Velero backup: pre-longhorn-upgrade-20251106-134603
Warning: ⚠️ Read upgrade guide: https://longhorn.io/docs/1.10.0/deploy/upgrade/
2.4 Core ArgoCD Apps¶
Status: ⏳ Not Started
Branch: upgrade/argocd-core-apps
Components: metallb, reloader
Phase 3: Observability Stack (Week 3)¶
3.1 kube-prometheus-stack¶
Status: ⏳ Not Started
Branch: upgrade/kube-prometheus-stack-v79.1.0
Current: 77.6.1
Target: 79.1.0
Priority: 🟡 Medium
3.2 Loki¶
Status: ⏳ Not Started
Branch: upgrade/loki-v6.45.2
Current: 6.40.0
Target: 6.45.2
Priority: 🟡 Medium
3.3 Grafana Operator¶
Status: ⏳ Not Started
Branch: upgrade/grafana-operator-v5.20.0
Current: v5.19.4
Target: v5.20.0
Priority: 🟢 Low
3.4 Grafana Alloy¶
Status: ⏳ Not Started
Branch: upgrade/grafana-alloy-v1.2.1
Current: 1.2.1
Target: 1.2.1 (verify if newer)
Priority: 🟢 Low
Phase 4: Core Services (Week 4)¶
4.1 Vault¶
Status: ⏳ Not Started
Branch: upgrade/vault-v0.31.0
Current: 0.30.0
Target: 0.31.0
Priority: 🟡 Medium
Warning: ⚠️ CRITICAL - Backup Vault data first
4.2 CNPG¶
Status: ⏳ Not Started
Branch: upgrade/cnpg-v0.26.1
Current: 0.26.0
Target: 0.26.1
Priority: 🟢 Low
4.3 Valkey¶
Status: ⏳ Blocked - Decision Needed Branch: TBD Current: 3.0.31 Target: 4.1.3 OR migrate to alternative Priority: 🟡 Medium Decision Required: - ⚠️ Bitnami requires commercial subscription after Aug 28, 2025 - Option A: Accept subscription - Option B: Migrate to official Valkey images or alternative
Phase 5: Auth & Applications (Week 5)¶
5.1 Authentik¶
Status: ⏳ Not Started
Branch: upgrade/authentik-v2025.10.0
Current: 2025.6.1
Target: 2025.10.0
Priority: 🟡 Medium
5.2 Argo Workflows¶
Status: ⏳ Not Started
Branch: upgrade/argo-workflows-v0.45.27
Current: 0.45.24
Target: 0.45.27
Priority: 🟢 Low
5.3 Windmill¶
Status: ⏳ Not Started
Branch: upgrade/windmill-v2.0.495
Current: 2.0.488
Target: 2.0.495
Priority: 🟢 Low
Warning: Do NOT upgrade to 3.x without extensive testing
Phase 6: Networking & Backup (Week 6)¶
6.1 Traefik¶
Status: ⏳ Not Started
Branch: upgrade/traefik-v37.2.0
Current: 35.4.0
Target: 37.2.0
Priority: 🟡 Medium
Warning: ⚠️ Major version jump - review 36.x and 37.x release notes
6.2 Velero¶
Status: ⏳ Not Started
Branch: upgrade/velero-v11.1.1
Current: 10.1.2
Target: 11.1.1
Priority: 🟡 Medium
Component Status Legend¶
- ⏳ Not Started
- 🔄 In Progress
- ✅ Completed
- ⏸️ Paused
- ❌ Failed/Rolled Back
- 🚫 Blocked
Priority Legend¶
- 🔴 High - Security/Stability critical
- 🟡 Medium - Feature updates, bug fixes
- 🟢 Low - Minor updates, nice-to-have
Testing Checklist Template¶
Run after EACH component update:
# 1. Pod health
kubectl get pods -A | grep -v Running | grep -v Completed
# 2. Application sync status (if ArgoCD-managed)
kubectl get applications -n argocd
# 3. Certificate validity (after cert-manager)
kubectl get certificates -A | grep False
# 4. External secrets sync (after ESO)
kubectl get externalsecrets -A | grep SecretSyncedError
# 5. Load balancer IPs (after MetalLB)
kubectl get svc -A | grep LoadBalancer
# 6. Prometheus targets (after monitoring updates)
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# Visit http://localhost:9090/targets
# 7. Vault health (after Vault update)
kubectl exec -n vault vault-0 -- vault status
# 8. Storage provisioning (after Longhorn)
kubectl get pvc -A | grep -v Bound
# 9. Component-specific logs
kubectl logs -n <namespace> <pod> | grep -i error
Rollback Procedures¶
Ansible-Managed Components¶
# 1. Revert chart_version in ansible/roles/k3sup/tasks/<component>.yml
# 2. Run playbook
ansible-playbook -i inventory/hosts.yml k3s-playbook.yml --tags <component>
ArgoCD-Managed Components¶
# Option 1: Git revert
git revert <commit-hash>
git push
# Option 2: ArgoCD rollback
argocd app rollback <app-name> <revision-number>
Nuclear Option¶
# Restore from Velero backup
velero restore create --from-backup <backup-name>
Success Criteria¶
Each component upgrade is considered successful when:
- ✅ Component deployed successfully
- ✅ All pods are Running/Completed
- ✅ Component-specific health checks pass
- ✅ Dependent services remain healthy
- ✅ No errors in component logs for 24 hours
- ✅ Integration tests pass (where applicable)
Notes & Decisions¶
2025-11-01¶
- Initial plan created
- Started with cert-manager as first high-priority update
- Branching strategy: One branch per component
- Will monitor each component for 24-48 hours before proceeding
Valkey Decision Pending¶
- Need to decide on Bitnami subscription vs migration by Dec 2025
- Research alternatives if migrating
Timeline¶
| Week | Phase | Components |
|---|---|---|
| 1 | Foundation | Ansible collections, cert-manager, ESO, MetalLB |
| 2 | Core Infrastructure | ArgoCD, Prometheus CRDs, Longhorn |
| 3 | Observability | kube-prometheus-stack, Loki, Grafana |
| 4 | Core Services | Vault, CNPG, Valkey decision |
| 5 | Auth & Apps | Authentik, Argo Workflows, Windmill |
| 6 | Network & Backup | Traefik, Velero |
| 7 | Buffer | Testing, documentation, cleanup |
Target Completion: Week 7 Current Week: Week 1