NATS Operations¶
Operational guide for NATS messaging infrastructure in the fzymgc-house cluster.
Quick Reference¶
| Property | Value |
|---|---|
| Namespace | nats |
| Service | nats.nats.svc.cluster.local:4222 |
| Helm Chart | nats/nats v2.12.3 |
| Storage | longhorn-encrypted (10Gi/node) |
| Vault Path | secret/fzymgc-house/cluster/nats |
| Dashboard | Grafana → NATS folder |
Client Access¶
Applications must label their namespace to connect to NATS:
This is enforced by NetworkPolicy. Without the label, connections will be blocked.
Architecture¶
┌─────────────────────────────────────────────────────────────┐
│ NATS 3-Node Cluster │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ nats-0 │◄──►│ nats-1 │◄──►│ nats-2 │ (Raft) │
│ │JetStream│ │JetStream│ │JetStream│ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ PVC (10Gi) PVC (10Gi) PVC (10Gi) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ SERVICES acct │ │ IOT acct │
│ (cluster apps) │ │ (Home Assistant)│
└─────────────────┘ └─────────────────┘
MQTT Bridge¶
Mosquitto runs in the mosquitto namespace and bridges to NATS via the MQTT listener on port 1883.
| Property | Value |
|---|---|
| Bridge address | nats.nats.svc:1883 |
| TLS | Required (handshake_first: true on NATS) |
| CA bundle | fzymgc-ica1-ca ConfigMap (fullchain.crt) |
| Auth | Bearer JWT from IOT account |
| Vault path | secret/fzymgc-house/cluster/mosquitto |
| Topic mapping | mqtt/<topic> (Mosquitto) ↔ <topic> (NATS) |
Authentication¶
The NATS MQTT listener runs in operator mode. MQTT clients must use a bearer user JWT as the password, with any non-empty username. Generate the bearer JWT with:
nsc add user --account IOT --name mqtt-bridge --bearer --conn-type MQTT
nsc generate creds --account IOT --name mqtt-bridge | grep -A1 'BEGIN.*JWT'
Store the JWT in Vault at secret/fzymgc-house/cluster/mosquitto as bridge_password.
TLS Configuration¶
NATS MQTT uses the nats-client-tls certificate (issued by fzymgc Intermediate CA1).
The listener is configured with mqtt.tls.handshake_first: true, which requires clients
to initiate the TLS handshake immediately (no STARTTLS).
Important: The bridge address must NOT use a trailing dot (e.g., nats.nats.svc not
nats.nats.svc.cluster.local.). Go's TLS implementation rejects trailing dots in SNI.
Topic Mapping¶
The bridge maps topics bidirectionally with a mqtt/ prefix:
- Mosquitto → NATS:
mqtt/sensors/temp→sensors/temp - NATS → Mosquitto:
sensors/temp→mqtt/sensors/temp
Troubleshooting¶
| Error | Cause | Solution |
|---|---|---|
tlsv1 alert decode error |
Trailing dot in hostname (SNI issue) | Use nats.nats.svc without trailing dot |
Connection Refused: not authorised |
Invalid/expired bearer JWT | Regenerate JWT with nsc |
JetStream not enabled for account |
IOT account missing JetStream | Re-push IOT account JWT with JetStream enabled |
unacceptable protocol version |
Race condition on first connect | Transient; bridge auto-retries |
NKey Authentication¶
NATS uses NKeys (Ed25519 key pairs) for authentication. Keys are organized in a hierarchy:
Operator: fzymgc-house
├── Account: SYS (system monitoring)
├── Account: SERVICES (cluster services)
└── Account: IOT (IoT devices)
Key Types¶
| Type | Prefix | Purpose | Storage |
|---|---|---|---|
| Operator | O |
Root signing authority | Vault (operator_jwt) |
| Account | A |
Namespace/tenant isolation | Vault (*_account_seed) |
| User | U |
Client authentication | Generated per-service |
Vault Secret Structure¶
| Key | Description |
|---|---|
operator_jwt |
Operator JWT (signed claims) |
operator_public |
Operator public key (ODA...) |
sys_account_seed |
SYS account private key |
sys_account_public |
SYS account public key |
sys_account_jwt |
Signed SYS account JWT |
services_account_seed |
SERVICES account private key |
services_account_public |
SERVICES account public key |
services_account_jwt |
Signed SERVICES account JWT |
iot_account_seed |
IOT account private key |
iot_account_public |
IOT account public key |
iot_account_jwt |
Signed IOT account JWT |
Key Management¶
Prerequisites¶
Install the nsc CLI:
Viewing Current Keys¶
# Export from Vault to temporary nsc environment
export VAULT_ADDR=https://vault.fzymgc.house
vault login -method=oidc
# View operator public key
vault kv get -field=operator_public secret/fzymgc-house/cluster/nats
# View all account public keys
for acct in sys services iot; do
echo "$acct: $(vault kv get -field=${acct}_account_public secret/fzymgc-house/cluster/nats)"
done
Adding a New Account¶
NATS uses a NATS-based resolver (type: full) where accounts are pushed dynamically via the PostSync job after NATS starts. Only the SYS account is preloaded for bootstrap.
- Set up temporary nsc environment:
export NKEYS_PATH=$(mktemp -d)
export NSC_HOME=$(mktemp -d)
# Import operator JWT from Vault
vault kv get -field=operator_jwt secret/fzymgc-house/cluster/nats > "$NSC_HOME/operator.jwt"
nsc add operator -u "$NSC_HOME/operator.jwt"
- Create new account:
# Add account
nsc add account NEW_ACCOUNT
# Export credentials
NEW_ACCT_SEED=$(nsc keys --account NEW_ACCOUNT --private)
NEW_ACCT_PUBLIC=$(nsc keys --account NEW_ACCOUNT)
NEW_ACCT_JWT=$(nsc describe account NEW_ACCOUNT --raw)
- Update Vault:
# Add to existing secret
vault kv patch secret/fzymgc-house/cluster/nats \
new_account_seed="$NEW_ACCT_SEED" \
new_account_public="$NEW_ACCT_PUBLIC" \
new_account_jwt="$NEW_ACCT_JWT"
- Update ExternalSecret:
Add the new keys to argocd/app-configs/nats/external-secret.yaml:
data:
- secretKey: new_account_jwt
remoteRef:
key: fzymgc-house/cluster/nats
property: new_account_jwt
- Update account-push-job.yaml:
Add the new account to the PostSync job in argocd/app-configs/nats/account-push-job.yaml:
# Add environment variable
env:
- name: NEW_ACCOUNT_JWT
valueFrom:
secretKeyRef:
name: nats-credentials
key: new_account_jwt
And add push commands in the script (following the existing pattern):
# Import and push NEW_ACCOUNT
echo "Importing NEW_ACCOUNT..."
echo "$NEW_ACCOUNT_JWT" > /tmp/new_account.jwt
nsc import account --file /tmp/new_account.jwt
echo "Pushing NEW_ACCOUNT..."
nsc push -a NEW_ACCOUNT -u "$NATS_URL" --ca-cert "$CA_FILE" --system-user push-admin
Note: The
--system-user push-adminflag uses the SYS account user created earlier in the job for authentication. The--ca-certflag enables TLS verification.
- Clean up:
- Commit and deploy:
ArgoCD will automatically sync the changes. The PostSync job will push the new account after NATS starts.
Creating User Credentials¶
Users authenticate using credentials files (.creds) generated from account keys:
# Generate user for SERVICES account
nsc add user -a SERVICES myservice
# Export credentials file
nsc generate creds -a SERVICES -n myservice > myservice.creds
The credentials file contains: - User JWT (signed by account) - User NKey seed (private key)
Key Rotation¶
Why only preload SYS account? The system account must be preloaded because it's required for NATS internal monitoring and health checks. Other accounts can be pushed dynamically after NATS starts, allowing for easier account management and updates without server restarts.
Account JWT Rotation (e.g., SERVICES, IOT):
-
Generate new account JWT signed by operator:
-
Update Vault with new JWT:
-
Wait for ExternalSecret to sync (up to 15 minutes) or force refresh:
-
Trigger ArgoCD sync to re-run the PostSync job:
The PostSync job will push the updated account JWT to the resolver. -
Verify account was updated:
Operator Key Rotation:
⚠️ Critical Operation - Requires resigning all accounts
- Generate new operator key pair and JWT
- Re-sign all account JWTs with new operator
- Update Vault secrets:
operator_jwtoperator_public- All
*_account_jwtfields (re-signed with new operator) - Force ArgoCD sync - NATS will restart with new config
- PostSync job will push accounts with new signatures
- Update all client credentials
Note: If ALSO rotating the SYS account (generating new keys, not just re-signing), update
values.yamlwith newsystem_accountandresolver_preloadkey.
Common Operations¶
Connect via nats-box¶
kubectl exec -it deploy/nats-box -n nats -- sh
# Inside nats-box
nats server check
nats server info
nats account info
Check Cluster Health¶
Expected output:
╭────────────────────────────────────────────────────────────────────────╮
│ JetStream Summary │
├──────────┬─────────┬────────┬─────────┬────────┬────────┬──────────────┤
│ Server │ Cluster │ Domain │ API │ Errors │ Memory │ File │
├──────────┼─────────┼────────┼─────────┼────────┼────────┼──────────────┤
│ nats-0 │ nats │ fzymgc │ 0 │ 0 │ 0 B │ 0 B / 10 GiB │
│ nats-1 │ nats │ fzymgc │ 0 │ 0 │ 0 B │ 0 B / 10 GiB │
│ nats-2* │ nats │ fzymgc │ 0 │ 0 │ 0 B │ 0 B / 10 GiB │
╰──────────┴─────────┴────────┴─────────┴────────┴────────┴──────────────╯
List Streams¶
Create a Stream¶
kubectl exec -it deploy/nats-box -n nats -- nats stream add EVENTS \
--subjects="events.>" \
--storage=file \
--replicas=3 \
--retention=limits \
--max-msgs=-1 \
--max-bytes=1073741824 \
--max-age=168h \
--discard=old
Publish/Subscribe Test¶
# Terminal 1: Subscribe
kubectl exec -it deploy/nats-box -n nats -- nats sub "test.>"
# Terminal 2: Publish
kubectl exec -it deploy/nats-box -n nats -- nats pub test.hello "Hello NATS"
View Metrics¶
# Port forward to metrics endpoint
kubectl port-forward svc/nats -n nats 7777:7777
# Curl metrics
curl localhost:7777/metrics
JetStream Operations¶
Stream Management¶
# List all streams
nats stream ls
# Get stream info
nats stream info STREAM_NAME
# Purge stream (delete all messages)
nats stream purge STREAM_NAME
# Delete stream
nats stream rm STREAM_NAME
Consumer Management¶
# List consumers for a stream
nats consumer ls STREAM_NAME
# Get consumer info
nats consumer info STREAM_NAME CONSUMER_NAME
# Delete consumer
nats consumer rm STREAM_NAME CONSUMER_NAME
Backup and Restore¶
JetStream streams can be backed up:
# Backup stream
nats stream backup STREAM_NAME /path/to/backup
# Restore stream
nats stream restore STREAM_NAME /path/to/backup
Monitoring¶
Grafana Dashboard¶
The NATS dashboard shows: - Connection count and throughput - Message rates (in/out) - JetStream stream metrics - Consumer lag and pending messages - Cluster replication status
Access: Grafana → NATS folder
Key Metrics¶
| Metric | Description | Alert Threshold |
|---|---|---|
gnatsd_connz_connections |
Active connections | - |
gnatsd_jetstream_storage_bytes |
Storage used | >80% capacity |
gnatsd_slow_consumers |
Slow consumer count | >0 for 5m |
gnatsd_auth_errors |
Authentication failures | >1/min |
Grafana Alerts¶
Defined in argocd/app-configs/nats/grafana-alerts.yaml using Grafana Unified Alerting:
| Alert | Severity | Condition |
|---|---|---|
| NATS JetStream Storage High | critical | Storage >80% for 5m |
| NATS Cluster Quorum Lost | critical | <2 healthy peers for 30s |
| NATS Slow Consumers | warning | Slow consumers for 5m |
| NATS Authentication Errors | warning | Auth errors >1/min for 5m |
| NATS High Connection Count | warning | >1000 connections for 5m |
Alerts are managed via the GrafanaAlertRuleGroup CRD and appear in Grafana's unified alerting UI.
Troubleshooting¶
Pod Not Starting¶
# Check pod status
kubectl get pods -n nats -l app.kubernetes.io/name=nats
# View pod events
kubectl describe pod nats-0 -n nats
# Check logs
kubectl logs nats-0 -n nats
Cluster Not Forming¶
# Check cluster routes
kubectl exec -it nats-0 -n nats -- nats server report connections
# Verify DNS resolution
kubectl exec -it nats-0 -n nats -- nslookup nats-headless.nats.svc.cluster.local
JetStream Not Available¶
# Check JetStream status
kubectl exec -it deploy/nats-box -n nats -- nats server check jetstream
# Verify storage class
kubectl get pvc -n nats
Authentication Failures¶
# Verify ExternalSecret synced
kubectl get externalsecret nats-credentials -n nats
# Check secret contents exist
kubectl get secret nats-credentials -n nats -o jsonpath='{.data}' | jq -r 'keys'
# Verify NATS config has resolver configured
kubectl exec -it nats-0 -n nats -- cat /etc/nats-config/nats.conf | grep -A5 resolver
Account Push Job Failures¶
The PostSync job pushes accounts to the NATS resolver after NATS starts. If it fails:
# Check job status
kubectl get jobs -n nats -l argocd.argoproj.io/hook=PostSync
# View job logs
kubectl logs -n nats job/nats-account-push
# Check for completed/failed pods
kubectl get pods -n nats -l job-name=nats-account-push
Common issues:
| Error | Cause | Solution |
|---|---|---|
NATS not ready after 4 minutes |
NATS pods not starting | Check NATS pod logs and events |
JetStream not ready after 2 minutes |
JetStream not initialized | Verify PVCs are bound, check storage class |
Failed to import operator JWT |
Invalid/expired operator JWT | Regenerate operator JWT in Vault |
Failed to import account JWT |
Account JWT not signed by operator | Re-sign account with current operator |
system account ... not found |
SYS account not imported | Ensure SYS_ACCOUNT_JWT is in the job |
Authorization Violation |
Missing user credentials for push | Ensure SYS account seed is imported and push-admin user is created |
unknown flag: --creds |
Wrong nsc push flag | Use --system-user instead of --creds |
unknown flag: --file for keys |
Wrong nsc import keys flag | Use --dir with a directory containing the key |
set an operator |
Operator context not set | Run nsc env -o <operator> before user creation |
scheme "tls" is not supported |
Wrong URL scheme | Use nats:// scheme; TLS is enabled via --ca-cert |
Re-running the job:
# Delete failed job to allow re-run
kubectl delete job nats-account-push -n nats
# Trigger ArgoCD sync
argocd app sync nats --force
Verify accounts after successful push:
# List all accounts in resolver
kubectl exec -it deploy/nats-box -n nats -- nats account list
# Check specific account
kubectl exec -it deploy/nats-box -n nats -- nats account info SERVICES
kubectl exec -it deploy/nats-box -n nats -- nats account info IOT
PostSync job authentication flow:
The account push job requires a specific sequence to authenticate with the NATS resolver:
- Import operator JWT - Establishes the trust root
- Import SYS account JWT - Identifies the system account
- Import SYS account seed - Enables signing new users (
nsc import keys --dir) - Set operator context - Tells nsc which operator to use (
nsc env -o) - Create push-admin user - Creates a user under SYS for authentication
- Push accounts - Uses
--system-user push-adminfor authentication
Key insight: The
nscandnatsCLIs have different conventions: -nscusesnats://scheme with--ca-certfor TLS -natsusestls://scheme with--tlscafor TLS
TLS Issues¶
# Check certificate status
kubectl get certificate -n nats
# Verify certificate secret
kubectl get secret nats-client-tls -n nats
# Test TLS connection
kubectl exec -it deploy/nats-box -n nats -- \
nats server info --tlsca /etc/nats-certs/ca.crt
Disaster Recovery¶
Full Cluster Recovery¶
-
Verify PVCs are intact:
-
Delete and recreate StatefulSet (keeps PVCs):
-
Verify cluster reformation:
Data Loss Recovery¶
If JetStream data is lost:
- Restore from Velero backup (if available)
- Or recreate streams from configuration
- Clients with durable consumers will resume from last ack
Security Considerations¶
- Operator JWT is the root of trust - protect it carefully
- Never commit NKey seeds or JWTs to Git
- Use separate accounts for different trust levels
- Rotate user credentials periodically
- Monitor authentication failures for potential attacks