Skip to content

NATS Operations

Operational guide for NATS messaging infrastructure in the fzymgc-house cluster.

Quick Reference

Property Value
Namespace nats
Service nats.nats.svc.cluster.local:4222
Helm Chart nats/nats v2.12.3
Storage longhorn-encrypted (10Gi/node)
Vault Path secret/fzymgc-house/cluster/nats
Dashboard Grafana → NATS folder

Client Access

Applications must label their namespace to connect to NATS:

kubectl label namespace <app-namespace> nats-client=true

This is enforced by NetworkPolicy. Without the label, connections will be blocked.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    NATS 3-Node Cluster                      │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐                 │
│  │ nats-0  │◄──►│ nats-1  │◄──►│ nats-2  │  (Raft)         │
│  │JetStream│    │JetStream│    │JetStream│                  │
│  └────┬────┘    └────┬────┘    └────┬────┘                  │
│       │              │              │                       │
│  PVC (10Gi)    PVC (10Gi)    PVC (10Gi)                    │
└─────────────────────────────────────────────────────────────┘
┌─────────────────┐   ┌─────────────────┐
│ SERVICES acct   │   │ IOT acct        │
│ (cluster apps)  │   │ (Home Assistant)│
└─────────────────┘   └─────────────────┘

MQTT Bridge

Mosquitto runs in the mosquitto namespace and bridges to NATS via the MQTT listener on port 1883.

Property Value
Bridge address nats.nats.svc:1883
TLS Required (handshake_first: true on NATS)
CA bundle fzymgc-ica1-ca ConfigMap (fullchain.crt)
Auth Bearer JWT from IOT account
Vault path secret/fzymgc-house/cluster/mosquitto
Topic mapping mqtt/<topic> (Mosquitto) ↔ <topic> (NATS)

Authentication

The NATS MQTT listener runs in operator mode. MQTT clients must use a bearer user JWT as the password, with any non-empty username. Generate the bearer JWT with:

nsc add user --account IOT --name mqtt-bridge --bearer --conn-type MQTT
nsc generate creds --account IOT --name mqtt-bridge | grep -A1 'BEGIN.*JWT'

Store the JWT in Vault at secret/fzymgc-house/cluster/mosquitto as bridge_password.

TLS Configuration

NATS MQTT uses the nats-client-tls certificate (issued by fzymgc Intermediate CA1). The listener is configured with mqtt.tls.handshake_first: true, which requires clients to initiate the TLS handshake immediately (no STARTTLS).

Important: The bridge address must NOT use a trailing dot (e.g., nats.nats.svc not nats.nats.svc.cluster.local.). Go's TLS implementation rejects trailing dots in SNI.

Topic Mapping

The bridge maps topics bidirectionally with a mqtt/ prefix:

  • Mosquitto → NATS: mqtt/sensors/tempsensors/temp
  • NATS → Mosquitto: sensors/tempmqtt/sensors/temp

Troubleshooting

Error Cause Solution
tlsv1 alert decode error Trailing dot in hostname (SNI issue) Use nats.nats.svc without trailing dot
Connection Refused: not authorised Invalid/expired bearer JWT Regenerate JWT with nsc
JetStream not enabled for account IOT account missing JetStream Re-push IOT account JWT with JetStream enabled
unacceptable protocol version Race condition on first connect Transient; bridge auto-retries

NKey Authentication

NATS uses NKeys (Ed25519 key pairs) for authentication. Keys are organized in a hierarchy:

Operator: fzymgc-house
├── Account: SYS (system monitoring)
├── Account: SERVICES (cluster services)
└── Account: IOT (IoT devices)

Key Types

Type Prefix Purpose Storage
Operator O Root signing authority Vault (operator_jwt)
Account A Namespace/tenant isolation Vault (*_account_seed)
User U Client authentication Generated per-service

Vault Secret Structure

vault kv get secret/fzymgc-house/cluster/nats
Key Description
operator_jwt Operator JWT (signed claims)
operator_public Operator public key (ODA...)
sys_account_seed SYS account private key
sys_account_public SYS account public key
sys_account_jwt Signed SYS account JWT
services_account_seed SERVICES account private key
services_account_public SERVICES account public key
services_account_jwt Signed SERVICES account JWT
iot_account_seed IOT account private key
iot_account_public IOT account public key
iot_account_jwt Signed IOT account JWT

Key Management

Prerequisites

Install the nsc CLI:

brew install nats-io/nats-tools/nsc
# or
go install github.com/nats-io/nsc/v2@latest

Viewing Current Keys

# Export from Vault to temporary nsc environment
export VAULT_ADDR=https://vault.fzymgc.house
vault login -method=oidc

# View operator public key
vault kv get -field=operator_public secret/fzymgc-house/cluster/nats

# View all account public keys
for acct in sys services iot; do
  echo "$acct: $(vault kv get -field=${acct}_account_public secret/fzymgc-house/cluster/nats)"
done

Adding a New Account

NATS uses a NATS-based resolver (type: full) where accounts are pushed dynamically via the PostSync job after NATS starts. Only the SYS account is preloaded for bootstrap.

  1. Set up temporary nsc environment:
export NKEYS_PATH=$(mktemp -d)
export NSC_HOME=$(mktemp -d)

# Import operator JWT from Vault
vault kv get -field=operator_jwt secret/fzymgc-house/cluster/nats > "$NSC_HOME/operator.jwt"
nsc add operator -u "$NSC_HOME/operator.jwt"
  1. Create new account:
# Add account
nsc add account NEW_ACCOUNT

# Export credentials
NEW_ACCT_SEED=$(nsc keys --account NEW_ACCOUNT --private)
NEW_ACCT_PUBLIC=$(nsc keys --account NEW_ACCOUNT)
NEW_ACCT_JWT=$(nsc describe account NEW_ACCOUNT --raw)
  1. Update Vault:
# Add to existing secret
vault kv patch secret/fzymgc-house/cluster/nats \
  new_account_seed="$NEW_ACCT_SEED" \
  new_account_public="$NEW_ACCT_PUBLIC" \
  new_account_jwt="$NEW_ACCT_JWT"
  1. Update ExternalSecret:

Add the new keys to argocd/app-configs/nats/external-secret.yaml:

data:
  - secretKey: new_account_jwt
    remoteRef:
      key: fzymgc-house/cluster/nats
      property: new_account_jwt
  1. Update account-push-job.yaml:

Add the new account to the PostSync job in argocd/app-configs/nats/account-push-job.yaml:

# Add environment variable
env:
  - name: NEW_ACCOUNT_JWT
    valueFrom:
      secretKeyRef:
        name: nats-credentials
        key: new_account_jwt

And add push commands in the script (following the existing pattern):

# Import and push NEW_ACCOUNT
echo "Importing NEW_ACCOUNT..."
echo "$NEW_ACCOUNT_JWT" > /tmp/new_account.jwt
nsc import account --file /tmp/new_account.jwt

echo "Pushing NEW_ACCOUNT..."
nsc push -a NEW_ACCOUNT -u "$NATS_URL" --ca-cert "$CA_FILE" --system-user push-admin

Note: The --system-user push-admin flag uses the SYS account user created earlier in the job for authentication. The --ca-cert flag enables TLS verification.

  1. Clean up:
rm -rf "$NKEYS_PATH" "$NSC_HOME"
  1. Commit and deploy:

ArgoCD will automatically sync the changes. The PostSync job will push the new account after NATS starts.

Creating User Credentials

Users authenticate using credentials files (.creds) generated from account keys:

# Generate user for SERVICES account
nsc add user -a SERVICES myservice

# Export credentials file
nsc generate creds -a SERVICES -n myservice > myservice.creds

The credentials file contains: - User JWT (signed by account) - User NKey seed (private key)

Key Rotation

Why only preload SYS account? The system account must be preloaded because it's required for NATS internal monitoring and health checks. Other accounts can be pushed dynamically after NATS starts, allowing for easier account management and updates without server restarts.

Account JWT Rotation (e.g., SERVICES, IOT):

  1. Generate new account JWT signed by operator:

    nsc edit account SERVICES  # Make changes
    nsc describe account SERVICES --raw > services.jwt
    

  2. Update Vault with new JWT:

    vault kv patch secret/fzymgc-house/cluster/nats \
      services_account_jwt="$(cat services.jwt)"
    

  3. Wait for ExternalSecret to sync (up to 15 minutes) or force refresh:

    kubectl annotate externalsecret nats-credentials -n nats \
      force-sync=$(date +%s) --overwrite
    

  4. Trigger ArgoCD sync to re-run the PostSync job:

    argocd app sync nats --force
    
    The PostSync job will push the updated account JWT to the resolver.

  5. Verify account was updated:

    kubectl exec -it deploy/nats-box -n nats -- nats account info SERVICES
    

Operator Key Rotation:

⚠️ Critical Operation - Requires resigning all accounts

  1. Generate new operator key pair and JWT
  2. Re-sign all account JWTs with new operator
  3. Update Vault secrets:
  4. operator_jwt
  5. operator_public
  6. All *_account_jwt fields (re-signed with new operator)
  7. Force ArgoCD sync - NATS will restart with new config
  8. PostSync job will push accounts with new signatures
  9. Update all client credentials

Note: If ALSO rotating the SYS account (generating new keys, not just re-signing), update values.yaml with new system_account and resolver_preload key.

Common Operations

Connect via nats-box

kubectl exec -it deploy/nats-box -n nats -- sh

# Inside nats-box
nats server check
nats server info
nats account info

Check Cluster Health

kubectl exec -it deploy/nats-box -n nats -- nats server report jetstream

Expected output:

╭────────────────────────────────────────────────────────────────────────╮
│                          JetStream Summary                             │
├──────────┬─────────┬────────┬─────────┬────────┬────────┬──────────────┤
│ Server   │ Cluster │ Domain │ API     │ Errors │ Memory │ File         │
├──────────┼─────────┼────────┼─────────┼────────┼────────┼──────────────┤
│ nats-0   │ nats    │ fzymgc │ 0       │ 0      │ 0 B    │ 0 B / 10 GiB │
│ nats-1   │ nats    │ fzymgc │ 0       │ 0      │ 0 B    │ 0 B / 10 GiB │
│ nats-2*  │ nats    │ fzymgc │ 0       │ 0      │ 0 B    │ 0 B / 10 GiB │
╰──────────┴─────────┴────────┴─────────┴────────┴────────┴──────────────╯

List Streams

kubectl exec -it deploy/nats-box -n nats -- nats stream ls

Create a Stream

kubectl exec -it deploy/nats-box -n nats -- nats stream add EVENTS \
  --subjects="events.>" \
  --storage=file \
  --replicas=3 \
  --retention=limits \
  --max-msgs=-1 \
  --max-bytes=1073741824 \
  --max-age=168h \
  --discard=old

Publish/Subscribe Test

# Terminal 1: Subscribe
kubectl exec -it deploy/nats-box -n nats -- nats sub "test.>"

# Terminal 2: Publish
kubectl exec -it deploy/nats-box -n nats -- nats pub test.hello "Hello NATS"

View Metrics

# Port forward to metrics endpoint
kubectl port-forward svc/nats -n nats 7777:7777

# Curl metrics
curl localhost:7777/metrics

JetStream Operations

Stream Management

# List all streams
nats stream ls

# Get stream info
nats stream info STREAM_NAME

# Purge stream (delete all messages)
nats stream purge STREAM_NAME

# Delete stream
nats stream rm STREAM_NAME

Consumer Management

# List consumers for a stream
nats consumer ls STREAM_NAME

# Get consumer info
nats consumer info STREAM_NAME CONSUMER_NAME

# Delete consumer
nats consumer rm STREAM_NAME CONSUMER_NAME

Backup and Restore

JetStream streams can be backed up:

# Backup stream
nats stream backup STREAM_NAME /path/to/backup

# Restore stream
nats stream restore STREAM_NAME /path/to/backup

Monitoring

Grafana Dashboard

The NATS dashboard shows: - Connection count and throughput - Message rates (in/out) - JetStream stream metrics - Consumer lag and pending messages - Cluster replication status

Access: Grafana → NATS folder

Key Metrics

Metric Description Alert Threshold
gnatsd_connz_connections Active connections -
gnatsd_jetstream_storage_bytes Storage used >80% capacity
gnatsd_slow_consumers Slow consumer count >0 for 5m
gnatsd_auth_errors Authentication failures >1/min

Grafana Alerts

Defined in argocd/app-configs/nats/grafana-alerts.yaml using Grafana Unified Alerting:

Alert Severity Condition
NATS JetStream Storage High critical Storage >80% for 5m
NATS Cluster Quorum Lost critical <2 healthy peers for 30s
NATS Slow Consumers warning Slow consumers for 5m
NATS Authentication Errors warning Auth errors >1/min for 5m
NATS High Connection Count warning >1000 connections for 5m

Alerts are managed via the GrafanaAlertRuleGroup CRD and appear in Grafana's unified alerting UI.

Troubleshooting

Pod Not Starting

# Check pod status
kubectl get pods -n nats -l app.kubernetes.io/name=nats

# View pod events
kubectl describe pod nats-0 -n nats

# Check logs
kubectl logs nats-0 -n nats

Cluster Not Forming

# Check cluster routes
kubectl exec -it nats-0 -n nats -- nats server report connections

# Verify DNS resolution
kubectl exec -it nats-0 -n nats -- nslookup nats-headless.nats.svc.cluster.local

JetStream Not Available

# Check JetStream status
kubectl exec -it deploy/nats-box -n nats -- nats server check jetstream

# Verify storage class
kubectl get pvc -n nats

Authentication Failures

# Verify ExternalSecret synced
kubectl get externalsecret nats-credentials -n nats

# Check secret contents exist
kubectl get secret nats-credentials -n nats -o jsonpath='{.data}' | jq -r 'keys'

# Verify NATS config has resolver configured
kubectl exec -it nats-0 -n nats -- cat /etc/nats-config/nats.conf | grep -A5 resolver

Account Push Job Failures

The PostSync job pushes accounts to the NATS resolver after NATS starts. If it fails:

# Check job status
kubectl get jobs -n nats -l argocd.argoproj.io/hook=PostSync

# View job logs
kubectl logs -n nats job/nats-account-push

# Check for completed/failed pods
kubectl get pods -n nats -l job-name=nats-account-push

Common issues:

Error Cause Solution
NATS not ready after 4 minutes NATS pods not starting Check NATS pod logs and events
JetStream not ready after 2 minutes JetStream not initialized Verify PVCs are bound, check storage class
Failed to import operator JWT Invalid/expired operator JWT Regenerate operator JWT in Vault
Failed to import account JWT Account JWT not signed by operator Re-sign account with current operator
system account ... not found SYS account not imported Ensure SYS_ACCOUNT_JWT is in the job
Authorization Violation Missing user credentials for push Ensure SYS account seed is imported and push-admin user is created
unknown flag: --creds Wrong nsc push flag Use --system-user instead of --creds
unknown flag: --file for keys Wrong nsc import keys flag Use --dir with a directory containing the key
set an operator Operator context not set Run nsc env -o <operator> before user creation
scheme "tls" is not supported Wrong URL scheme Use nats:// scheme; TLS is enabled via --ca-cert

Re-running the job:

# Delete failed job to allow re-run
kubectl delete job nats-account-push -n nats

# Trigger ArgoCD sync
argocd app sync nats --force

Verify accounts after successful push:

# List all accounts in resolver
kubectl exec -it deploy/nats-box -n nats -- nats account list

# Check specific account
kubectl exec -it deploy/nats-box -n nats -- nats account info SERVICES
kubectl exec -it deploy/nats-box -n nats -- nats account info IOT

PostSync job authentication flow:

The account push job requires a specific sequence to authenticate with the NATS resolver:

  1. Import operator JWT - Establishes the trust root
  2. Import SYS account JWT - Identifies the system account
  3. Import SYS account seed - Enables signing new users (nsc import keys --dir)
  4. Set operator context - Tells nsc which operator to use (nsc env -o)
  5. Create push-admin user - Creates a user under SYS for authentication
  6. Push accounts - Uses --system-user push-admin for authentication

Key insight: The nsc and nats CLIs have different conventions: - nsc uses nats:// scheme with --ca-cert for TLS - nats uses tls:// scheme with --tlsca for TLS

TLS Issues

# Check certificate status
kubectl get certificate -n nats

# Verify certificate secret
kubectl get secret nats-client-tls -n nats

# Test TLS connection
kubectl exec -it deploy/nats-box -n nats -- \
  nats server info --tlsca /etc/nats-certs/ca.crt

Disaster Recovery

Full Cluster Recovery

  1. Verify PVCs are intact:

    kubectl get pvc -n nats
    

  2. Delete and recreate StatefulSet (keeps PVCs):

    kubectl delete sts nats -n nats --cascade=orphan
    # ArgoCD will recreate the StatefulSet
    

  3. Verify cluster reformation:

    kubectl exec -it deploy/nats-box -n nats -- nats server report jetstream
    

Data Loss Recovery

If JetStream data is lost:

  1. Restore from Velero backup (if available)
  2. Or recreate streams from configuration
  3. Clients with durable consumers will resume from last ack

Security Considerations

  • Operator JWT is the root of trust - protect it carefully
  • Never commit NKey seeds or JWTs to Git
  • Use separate accounts for different trust levels
  • Rotate user credentials periodically
  • Monitor authentication failures for potential attacks

References