Skip to Content
RunbooksSelf-hosted NATS (JetStream) runbook

Self-hosted NATS (JetStream) runbook

This runbook covers the self-hosted NATS JetStream cluster running on EC2 (3-node, Graviton, NVMe-backed). All services connect via Route53 private DNS.

Architecture

  • Instances: 3x m8gd.medium (Graviton4, 1 vCPU, 4 GB RAM, NVMe local storage)
  • Regions: us-east-2 (one node per AZ)
  • Auth: Operator mode with MEMORY resolver (account JWTs preloaded at boot)
  • Stream: pivot_main — R3, interest retention, all service subjects
  • Networking: Private subnets, NAT gateway for outbound, no public IPs
  • Logging: Axiom (primary), CloudWatch (bootstrap-only fallback)

DNS names

NodeDNSPort
node-1<cluster>-node-1.<cluster>.internal4222 (client), 6222 (cluster), 8222 (monitor)
node-2<cluster>-node-2.<cluster>.internalsame
node-3<cluster>-node-3.<cluster>.internalsame

Prerequisites

  • Cloudflare WARP connected. See Cloudflare Tunnels.

  • AWS CLI configured with access to the target account.

  • NATS CLI installed:

    brew install nats-io/nats-tools/nats

Connect with admin credentials (from your laptop)

  1. Export the cluster name and region:

    export AWS_REGION="us-east-2" export CLUSTER_NAME="nats-staging" # or nats-prod export SERVICE_NAME="facebox" # any service, just for the host list export SSM_PREFIX="/nats-cluster/${CLUSTER_NAME}"
  2. Fetch the connection string and admin creds from SSM:

    NATS_HOST=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_host" \ --query 'Parameter.Value' \ --output text) ADMIN_JWT=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/admin/nats_admin_jwt" \ --with-decryption \ --query 'Parameter.Value' \ --output text) ADMIN_NKEY=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/admin/nats_admin_nkey_seed" \ --with-decryption \ --query 'Parameter.Value' \ --output text)
  3. Create a local creds file:

    cat > ./nats-admin.creds <<EOF -----BEGIN NATS USER JWT----- $ADMIN_JWT ------END NATS USER JWT------ ************************* IMPORTANT ************************* NKEY Seed printed below can be used to sign and prove identity. NKEYs are sensitive and should be treated as secrets. -----BEGIN USER NKEY SEED----- $ADMIN_NKEY ------END USER NKEY SEED------ ************************************************************* EOF chmod 600 ./nats-admin.creds
  4. Use the NATS CLI:

    nats --server "$NATS_HOST" --creds ./nats-admin.creds stream ls nats --server "$NATS_HOST" --creds ./nats-admin.creds stream info pivot_main nats --server "$NATS_HOST" --creds ./nats-admin.creds server report jetstream
  5. Clean up when done:

    rm -f ./nats-admin.creds

Connect with service credentials

Replace the admin SSM paths with service-specific ones:

SERVICE_NAME="messenger" SERVICE_JWT=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_admin_jwt" \ --with-decryption \ --query 'Parameter.Value' \ --output text) SERVICE_NKEY=$(aws ssm get-parameter \ --region "$AWS_REGION" \ --name "$SSM_PREFIX/services/$SERVICE_NAME/nats_admin_nkey_seed" \ --with-decryption \ --query 'Parameter.Value' \ --output text)

Build a creds file using the same format as above and connect with:

nats --server "$NATS_HOST" --creds ./nats-service.creds sub "messenger.>"

SSM Session Manager (on-instance access)

All NATS instances have AmazonSSMManagedInstanceCore attached. To connect:

# Find the instance ID for a node aws ec2 describe-instances \ --region us-east-2 \ --filters "Name=tag:Name,Values=${CLUSTER_NAME}-node-1" \ --query 'Reservations[0].Instances[0].InstanceId' \ --output text # Start a session aws ssm start-session --region us-east-2 --target <instance-id>

Useful commands on the instance

# NATS server status systemctl status nats # Bootstrap log (full history from boot) cat /var/log/nats-bootstrap.log # NATS server logs (live) journalctl -u nats -f # Monitoring service logs (live) journalctl -u nats-monitor -f # Fluent Bit status systemctl status fluent-bit # Disk usage (NVMe mount for JetStream) df -h /var/lib/nats # Use NATS CLI with admin creds (context is pre-configured on node-1) nats context select pivot nats stream ls nats stream info pivot_main nats server report jetstream nats server report connections

Health checks

# From laptop (requires Cloudflare WARP) curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/healthz # All monitoring endpoints curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/varz | jq . curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/jsz | jq . curl http://${CLUSTER_NAME}-node-1.${CLUSTER_NAME}.internal:8222/connz | jq .

View node logs

Axiom (primary)

All NATS logs go to the nats-logs Axiom dataset: bootstrap, server, and monitoring data from all 3 nodes. This is the primary place to look.

CloudWatch (bootstrap-only fallback)

CloudWatch receives only bootstrap logs — the minimum needed to diagnose instances that fail before Axiom/Fluent Bit are operational.

  • Log group: /aws/ec2/nats-cluster/<cluster_name>/diagnostic
  • Stream prefix: <cluster_name>-node-<n>-bootstrap-
  • Retention: 1 day (staging), 3 days (production)

CloudWatch Logs Insights query to check bootstrap health:

fields @timestamp, @message | filter @message like /Axiom ingest check/ | sort @timestamp desc | limit 200

Monitoring and alerting

The monitoring script on each node polls /varz, /connz, /jsz, /healthz and collects system-level metrics (CPU, memory, disk) every ~30 seconds. The combined JSON is sent to Axiom via Fluent Bit.

Key fields for Axiom alerts

NATS metrics

FieldDescriptionAlert threshold
jsz.storageJetStream bytes usedAlert when approaching jsz.config.max_file_store
jsz.config.max_file_storeConfigured max storageReference value
jsz.api.errorsJetStream API error countAlert on sustained non-zero
connz.num_connectionsActive client connectionsAlert on zero (no services connected)
varz.slow_consumersSlow consumer countAlert on non-zero
health_statusContains HTTP status codeAlert when not containing 200

System metrics (under system.*)

FieldDescriptionAlert threshold
system.cpu_percentOverall CPU utilization %Alert >= 80%
system.iowait_percentCPU time waiting on I/O %Alert >= 20%
system.steal_percentCPU steal time % (noisy neighbor)Alert >= 10%
system.mem_used_percentMemory utilization %Alert >= 85%
system.mem_total_bytesTotal system memoryReference
system.mem_used_bytesUsed memoryReference
system.mem_available_bytesAvailable memoryReference
system.disk_used_percentNVMe disk utilization %Alert >= 80%
system.disk_total_bytesNVMe total capacityReference
system.disk_used_bytesNVMe used bytesReference
system.disk_available_bytesNVMe available bytesReference
system.load_11-min load averageReference
system.load_55-min load averageReference
system.load_1515-min load averageReference

Rollouts

Instance refreshes are managed by Terraform and run sequentially with a 120-second inter-node delay for RAFT/JetStream sync:

node-1 refresh → wait for success → sleep 120s → node-2 refresh → wait for success → sleep 120s → node-3 refresh → wait for success

Each node has a 120-second InstanceWarmup before ASG considers it healthy. Pushing to main triggers Terraform Cloud plan/apply which handles everything automatically.

Manual single-node replacement

If you need to replace a single node outside of Terraform:

export AWS_REGION="us-east-2" export CLUSTER_NAME="nats-staging" export ASG_NAME="${CLUSTER_NAME}-node-1-asg" INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \ --region "$AWS_REGION" \ --auto-scaling-group-names "$ASG_NAME" \ --query 'AutoScalingGroups[0].Instances[0].InstanceId' \ --output text) aws autoscaling terminate-instance-in-auto-scaling-group \ --region "$AWS_REGION" \ --instance-id "$INSTANCE_ID" \ --should-decrement-desired-capacity false

Wait for the replacement to come up and verify before touching the next node:

  1. Check CloudWatch for Axiom ingest check in the bootstrap stream.
  2. Check Axiom nats-logs for fresh events from that node.
  3. Run nats server report jetstream to confirm RAFT replicas are current.

Switching a service to self-hosted NATS

Each service’s nats_ssm_parameter_arn_base in services.tf controls which NATS cluster it connects to. To switch a service:

  1. In apps/terraform-aws-backend-staging/src/services.tf, change:

    # Old (Synadia Cloud) nats_ssm_parameter_arn_base = "...parameter${module.nats.ssm_prefix}" # New (self-hosted) nats_ssm_parameter_arn_base = "...parameter${module.nats_cluster.service_ssm_prefixes["<service_name>"]}"
  2. The IAM permission for /nats-cluster/* SSM reads is already added to all service modules — no additional IAM changes needed.

  3. Push to main. Terraform will update the ECS task definition and deploy.

The SSM parameter names are identical between old and new (nats_host, nats_admin_jwt, nats_admin_nkey_seed), so no application code changes are needed.

Last updated on