Self-hosted NATS (JetStream) runbook
This runbook covers the self-hosted NATS JetStream cluster running on EC2 (3-node, Graviton, NVMe-backed). All services connect via Route53 private DNS.
Architecture
- Instances: 3x
m8gd.medium(Graviton4, 1 vCPU, 4 GB RAM, NVMe local storage) - Regions:
us-east-2(one node per AZ) - Auth: Operator mode with MEMORY resolver (account JWTs preloaded at boot)
- Stream:
pivot_main— R3, interest retention, all service subjects - Networking: Private subnets, NAT gateway for outbound, no public IPs
- Logging: Axiom (primary), CloudWatch (bootstrap-only fallback)
DNS names
| Node | DNS | Port |
|---|---|---|
| node-1 | <cluster>-node-1.<cluster>.internal | 4222 (client), 6222 (cluster), 8222 (monitor) |
| node-2 | <cluster>-node-2.<cluster>.internal | same |
| node-3 | <cluster>-node-3.<cluster>.internal | same |
Prerequisites
-
Cloudflare WARP connected. See Cloudflare Tunnels.
-
AWS CLI configured with access to the target account.
-
NATS CLI installed:
brew install nats-io/nats-tools/nats
Connect with admin credentials (from your laptop)
-
Export the cluster name and region:
export AWS_REGION="us-east-2" export CLUSTER_NAME="nats-staging" # or nats-prod export SERVICE_NAME="facebox" # any service, just for the host list export SSM_PREFIX="/nats-cluster/${CLUSTER_NAME}" -
Fetch the connection string and admin creds from SSM:
NATS_HOST=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_host" \ --query 'Parameter.Value' \ --output text) ADMIN_JWT=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/admin/nats_admin_jwt" \ --with-decryption \ --query 'Parameter.Value' \ --output text) ADMIN_NKEY=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/admin/nats_admin_nkey_seed" \ --with-decryption \ --query 'Parameter.Value' \ --output text) -
Create a local creds file:
cat > ./nats-admin.creds <<EOF -----BEGIN NATS USER JWT----- $ADMIN_JWT ------END NATS USER JWT------ ************************* IMPORTANT ************************* NKEY Seed printed below can be used to sign and prove identity. NKEYs are sensitive and should be treated as secrets. -----BEGIN USER NKEY SEED----- $ADMIN_NKEY ------END USER NKEY SEED------ ************************************************************* EOF chmod 600 ./nats-admin.creds -
Use the NATS CLI:
nats --server "$NATS_HOST" --creds ./nats-admin.creds stream ls nats --server "$NATS_HOST" --creds ./nats-admin.creds stream info pivot_main nats --server "$NATS_HOST" --creds ./nats-admin.creds server report jetstream -
Clean up when done:
rm -f ./nats-admin.creds
Connect with service credentials
Replace the admin SSM paths with service-specific ones:
SERVICE_NAME="messenger"
SERVICE_JWT=$(aws ssm get-parameter \
--region "$AWS_REGION" \
--name "$SSM_PREFIX/services/$SERVICE_NAME/nats_admin_jwt" \
--with-decryption \
--query 'Parameter.Value' \
--output text)
SERVICE_NKEY=$(aws ssm get-parameter \
--region "$AWS_REGION" \
--name "$SSM_PREFIX/services/$SERVICE_NAME/nats_admin_nkey_seed" \
--with-decryption \
--query 'Parameter.Value' \
--output text)Build a creds file using the same format as above and connect with:
nats --server "$NATS_HOST" --creds ./nats-service.creds sub "messenger.>"SSM Session Manager (on-instance access)
All NATS instances have AmazonSSMManagedInstanceCore attached. To connect:
# Find the instance ID for a node
aws ec2 describe-instances \
--region us-east-2 \
--filters "Name=tag:Name,Values=${CLUSTER_NAME}-node-1" \
--query 'Reservations[0].Instances[0].InstanceId' \
--output text
# Start a session
aws ssm start-session --region us-east-2 --target <instance-id>Useful commands on the instance
# NATS server status
systemctl status nats
# Bootstrap log (full history from boot)
cat /var/log/nats-bootstrap.log
# NATS server logs (live)
journalctl -u nats -f
# Monitoring service logs (live)
journalctl -u nats-monitor -f
# Fluent Bit status
systemctl status fluent-bit
# Disk usage (NVMe mount for JetStream)
df -h /var/lib/nats
# Use NATS CLI with admin creds (context is pre-configured on node-1)
nats context select pivot
nats stream ls
nats stream info pivot_main
nats server report jetstream
nats server report connectionsHealth checks
# From laptop (requires Cloudflare WARP)
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/healthz
# All monitoring endpoints
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/varz | jq .
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/jsz | jq .
curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/connz | jq .View node logs
Axiom (primary)
All NATS logs go to the nats-logs Axiom dataset: bootstrap, server, and
monitoring data from all 3 nodes. This is the primary place to look.
CloudWatch (bootstrap-only fallback)
CloudWatch receives only bootstrap logs — the minimum needed to diagnose instances that fail before Axiom/Fluent Bit are operational.
- Log group:
/aws/ec2/nats-cluster/<cluster_name>/diagnostic - Stream prefix:
<cluster_name>-node-<n>-bootstrap- - Retention: 1 day (staging), 3 days (production)
CloudWatch Logs Insights query to check bootstrap health:
fields @timestamp, @message
| filter @message like /Axiom ingest check/
| sort @timestamp desc
| limit 200Monitoring and alerting
The monitoring script on each node polls /varz, /connz, /jsz, /healthz
and collects system-level metrics (CPU, memory, disk) every ~30 seconds. The
combined JSON is sent to Axiom via Fluent Bit.
Key fields for Axiom alerts
NATS metrics
| Field | Description | Alert threshold |
|---|---|---|
jsz.storage | JetStream bytes used | Alert when approaching jsz.config.max_file_store |
jsz.config.max_file_store | Configured max storage | Reference value |
jsz.api.errors | JetStream API error count | Alert on sustained non-zero |
connz.num_connections | Active client connections | Alert on zero (no services connected) |
varz.slow_consumers | Slow consumer count | Alert on non-zero |
health_status | Contains HTTP status code | Alert when not containing 200 |
System metrics (under system.*)
| Field | Description | Alert threshold |
|---|---|---|
system.cpu_percent | Overall CPU utilization % | Alert >= 80% |
system.iowait_percent | CPU time waiting on I/O % | Alert >= 20% |
system.steal_percent | CPU steal time % (noisy neighbor) | Alert >= 10% |
system.mem_used_percent | Memory utilization % | Alert >= 85% |
system.mem_total_bytes | Total system memory | Reference |
system.mem_used_bytes | Used memory | Reference |
system.mem_available_bytes | Available memory | Reference |
system.disk_used_percent | NVMe disk utilization % | Alert >= 80% |
system.disk_total_bytes | NVMe total capacity | Reference |
system.disk_used_bytes | NVMe used bytes | Reference |
system.disk_available_bytes | NVMe available bytes | Reference |
system.load_1 | 1-min load average | Reference |
system.load_5 | 5-min load average | Reference |
system.load_15 | 15-min load average | Reference |
Rollouts
Instance refreshes are managed by Terraform and run sequentially with a 120-second inter-node delay for RAFT/JetStream sync:
node-1 refresh → wait for success → sleep 120s →
node-2 refresh → wait for success → sleep 120s →
node-3 refresh → wait for successEach node has a 120-second InstanceWarmup before ASG considers it healthy.
Pushing to main triggers Terraform Cloud plan/apply which handles everything
automatically.
Manual single-node replacement
If you need to replace a single node outside of Terraform:
export AWS_REGION="us-east-2"
export CLUSTER_NAME="nats-staging"
export ASG_NAME="${CLUSTER_NAME}-node-1-asg"
INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \
--region "$AWS_REGION" \
--auto-scaling-group-names "$ASG_NAME" \
--query 'AutoScalingGroups[0].Instances[0].InstanceId' \
--output text)
aws autoscaling terminate-instance-in-auto-scaling-group \
--region "$AWS_REGION" \
--instance-id "$INSTANCE_ID" \
--should-decrement-desired-capacity falseWait for the replacement to come up and verify before touching the next node:
- Check CloudWatch for
Axiom ingest checkin the bootstrap stream. - Check Axiom
nats-logsfor fresh events from that node. - Run
nats server report jetstreamto confirm RAFT replicas are current.
Switching a service to self-hosted NATS
Each service’s nats_ssm_parameter_arn_base in services.tf controls which
NATS cluster it connects to. To switch a service:
-
In
apps/terraform-aws-backend-staging/src/services.tf, change:# Old (Synadia Cloud) nats_ssm_parameter_arn_base = "...parameter${module.nats.ssm_prefix}" # New (self-hosted) nats_ssm_parameter_arn_base = "...parameter${module.nats_cluster.service_ssm_prefixes["<service_name>"]}" -
The IAM permission for
/nats-cluster/*SSM reads is already added to all service modules — no additional IAM changes needed. -
Push to
main. Terraform will update the ECS task definition and deploy.
The SSM parameter names are identical between old and new (nats_host,
nats_admin_jwt, nats_admin_nkey_seed), so no application code changes are
needed.