Self-hosted NATS (JetStream) runbook

This runbook covers the self-hosted NATS JetStream cluster running on EC2 (3-node, Graviton, NVMe-backed). All services connect via Route53 private DNS.

Architecture

Instances: 3x m8gd.medium (Graviton4, 1 vCPU, 4 GB RAM, NVMe local storage)
Regions: us-east-2 (one node per AZ)
Auth: Operator mode with MEMORY resolver (account JWTs preloaded at boot)
Stream: pivot_main — R3, interest retention, all service subjects
Networking: Private subnets, NAT gateway for outbound, no public IPs
Logging: Axiom (primary), CloudWatch (bootstrap-only fallback)

DNS names

Node	DNS	Port
node-1	`<cluster>-node-1.<cluster>.internal`	4222 (client), 6222 (cluster), 8222 (monitor)
node-2	`<cluster>-node-2.<cluster>.internal`	same
node-3	`<cluster>-node-3.<cluster>.internal`	same

Prerequisites

Cloudflare WARP connected. See Cloudflare Tunnels.
AWS CLI configured with access to the target account.
NATS CLI installed:
```
brew install nats-io/nats-tools/nats
```

Connect with admin credentials (from your laptop)

Export the cluster name and region:


export AWS_REGION="us-east-2"
export CLUSTER_NAME="nats-staging"  # or nats-prod
export SERVICE_NAME="facebox"       # any service, just for the host list
export SSM_PREFIX="/nats-cluster/${CLUSTER_NAME}"

Fetch the connection string and admin creds from SSM:


NATS_HOST=$(aws ssm get-parameter \
  --region "$AWS_REGION" \
  --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_host" \
  --query 'Parameter.Value' \
  --output text)
 
ADMIN_JWT=$(aws ssm get-parameter \
  --region "$AWS_REGION" \
  --name "$SSM_PREFIX/admin/nats_admin_jwt" \
  --with-decryption \
  --query 'Parameter.Value' \
  --output text)
 
ADMIN_NKEY=$(aws ssm get-parameter \
  --region "$AWS_REGION" \
  --name "$SSM_PREFIX/admin/nats_admin_nkey_seed" \
  --with-decryption \
  --query 'Parameter.Value' \
  --output text)

Create a local creds file:


cat > ./nats-admin.creds <<EOF
-----BEGIN NATS USER JWT-----
$ADMIN_JWT
------END NATS USER JWT------
 
************************* IMPORTANT *************************
NKEY Seed printed below can be used to sign and prove identity.
NKEYs are sensitive and should be treated as secrets.
 
-----BEGIN USER NKEY SEED-----
$ADMIN_NKEY
------END USER NKEY SEED------
 
*************************************************************
EOF
 
chmod 600 ./nats-admin.creds

Use the NATS CLI:


nats --server "$NATS_HOST" --creds ./nats-admin.creds stream ls
nats --server "$NATS_HOST" --creds ./nats-admin.creds stream info pivot_main
nats --server "$NATS_HOST" --creds ./nats-admin.creds server report jetstream

Clean up when done:
```
rm -f ./nats-admin.creds
```

Connect with service credentials

Replace the admin SSM paths with service-specific ones:


SERVICE_NAME="messenger"
 
SERVICE_JWT=$(aws ssm get-parameter \
  --region "$AWS_REGION" \
  --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_admin_jwt" \
  --with-decryption \
  --query 'Parameter.Value' \
  --output text)
 
SERVICE_NKEY=$(aws ssm get-parameter \
  --region "$AWS_REGION" \
  --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_admin_nkey_seed" \
  --with-decryption \
  --query 'Parameter.Value' \
  --output text)

Build a creds file using the same format as above and connect with:


nats --server "$NATS_HOST" --creds ./nats-service.creds sub "messenger.>"

SSM Session Manager (on-instance access)

All NATS instances have AmazonSSMManagedInstanceCore attached. To connect:


# Find the instance ID for a node
aws ec2 describe-instances \
  --region us-east-2 \
  --filters "Name=tag:Name,Values=${CLUSTER_NAME}-node-1" \
  --query 'Reservations[0].Instances[0].InstanceId' \
  --output text
 
# Start a session
aws ssm start-session --region us-east-2 --target <instance-id>

Useful commands on the instance


# NATS server status
systemctl status nats
 
# Bootstrap log (full history from boot)
cat /var/log/nats-bootstrap.log
 
# NATS server logs (live)
journalctl -u nats -f
 
# Monitoring service logs (live)
journalctl -u nats-monitor -f
 
# Fluent Bit status
systemctl status fluent-bit
 
# Disk usage (NVMe mount for JetStream)
df -h /var/lib/nats
 
# Use NATS CLI with admin creds (context is pre-configured on node-1)
nats context select pivot
nats stream ls
nats stream info pivot_main
nats server report jetstream
nats server report connections

Health checks


# From laptop (requires Cloudflare WARP)
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/healthz
 
# All monitoring endpoints
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/varz | jq .
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/jsz | jq .
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/connz | jq .

View node logs

Axiom (primary)

All NATS logs go to the nats-logs Axiom dataset: bootstrap, server, and monitoring data from all 3 nodes. This is the primary place to look.

CloudWatch (bootstrap-only fallback)

CloudWatch receives only bootstrap logs — the minimum needed to diagnose instances that fail before Axiom/Fluent Bit are operational.

Log group: /aws/ec2/nats-cluster/<cluster_name>/diagnostic
Stream prefix: <cluster_name>-node-<n>-bootstrap-
Retention: 1 day (staging), 3 days (production)

CloudWatch Logs Insights query to check bootstrap health:


fields @timestamp, @message
| filter @message like /Axiom ingest check/
| sort @timestamp desc
| limit 200

Monitoring and alerting

The monitoring script on each node polls /varz, /connz, /jsz, /healthz and collects system-level metrics (CPU, memory, disk) every ~30 seconds. The combined JSON is sent to Axiom via Fluent Bit.

Key fields for Axiom alerts

NATS metrics

Field	Description	Alert threshold
`jsz.storage`	JetStream bytes used	Alert when approaching `jsz.config.max_file_store`
`jsz.config.max_file_store`	Configured max storage	Reference value
`jsz.api.errors`	JetStream API error count	Alert on sustained non-zero
`connz.num_connections`	Active client connections	Alert on zero (no services connected)
`varz.slow_consumers`	Slow consumer count	Alert on non-zero
`health_status`	Contains HTTP status code	Alert when not containing `200`

System metrics (under `system.*`)

Field	Description	Alert threshold
`system.cpu_percent`	Overall CPU utilization %	Alert >= 80%
`system.iowait_percent`	CPU time waiting on I/O %	Alert >= 20%
`system.steal_percent`	CPU steal time % (noisy neighbor)	Alert >= 10%
`system.mem_used_percent`	Memory utilization %	Alert >= 85%
`system.mem_total_bytes`	Total system memory	Reference
`system.mem_used_bytes`	Used memory	Reference
`system.mem_available_bytes`	Available memory	Reference
`system.disk_used_percent`	NVMe disk utilization %	Alert >= 80%
`system.disk_total_bytes`	NVMe total capacity	Reference
`system.disk_used_bytes`	NVMe used bytes	Reference
`system.disk_available_bytes`	NVMe available bytes	Reference
`system.load_1`	1-min load average	Reference
`system.load_5`	5-min load average	Reference
`system.load_15`	15-min load average	Reference

Rollouts

Instance refreshes are managed by Terraform and run sequentially with a 120-second inter-node delay for RAFT/JetStream sync:


node-1 refresh → wait for success → sleep 120s →
node-2 refresh → wait for success → sleep 120s →
node-3 refresh → wait for success

Each node has a 120-second InstanceWarmup before ASG considers it healthy. Pushing to main triggers Terraform Cloud plan/apply which handles everything automatically.

Manual single-node replacement

If you need to replace a single node outside of Terraform:


export AWS_REGION="us-east-2"
export CLUSTER_NAME="nats-staging"
export ASG_NAME="${CLUSTER_NAME}-node-1-asg"
 
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \
  --region "$AWS_REGION" \
  --auto-scaling-group-names "$ASG_NAME" \
  --query 'AutoScalingGroups[0].Instances[0].InstanceId' \
  --output text)
 
aws autoscaling terminate-instance-in-auto-scaling-group \
  --region "$AWS_REGION" \
  --instance-id "$INSTANCE_ID" \
  --should-decrement-desired-capacity false

Wait for the replacement to come up and verify before touching the next node:

Check CloudWatch for Axiom ingest check in the bootstrap stream.
Check Axiom nats-logs for fresh events from that node.
Run nats server report jetstream to confirm RAFT replicas are current.

Switching a service to self-hosted NATS

Each service’s nats_ssm_parameter_arn_base in services.tf controls which NATS cluster it connects to. To switch a service:

In apps/terraform-aws-backend-staging/src/services.tf, change:


# Old (Synadia Cloud)
nats_ssm_parameter_arn_base = "...parameter${module.nats.ssm_prefix}"
 
# New (self-hosted)
nats_ssm_parameter_arn_base = "...parameter${module.nats_cluster.service_ssm_prefixes["<service_name>"]}"

The IAM permission for /nats-cluster/* SSM reads is already added to all service modules — no additional IAM changes needed.
Push to main. Terraform will update the ECS task definition and deploy.

The SSM parameter names are identical between old and new (nats_host, nats_admin_jwt, nats_admin_nkey_seed), so no application code changes are needed.