Elastic Container Service (ECS)

ECS is how the Pivot backend is deployed. This includes all internal services and all API services.

ECS takes ongoing tuning, scaling, upgrades, and maintenance. It is the core of our maintenance checklist.

Fargate

We use Fargate to deploy tasks even though that is higher cost than EC2 for the following reasons:

  1. Reliability – The Fargate service is responsible for the EC2 instances and for cross-AZ availability.
  2. Ease of scale – Fargate tasks launch fast, EC2 tasks depend on whether existing capacity was available.
  3. Networking scale – with EC2, we are stuck with a very limited number of network interfaces allowed per EC2 instance.

The downside of Fargate is we have to run an OpenTelemetry collector on each of our services that traces, rather than one per EC2 instance.

Volumes

NATS is deployed with an ephemeral volume. Due to ECS restrictions, volumes are not durable beyond the life of a task, they die when the task dies.

Backups are not taken for NATS as it does not have an S3 backup provider built in and isn't designed for long term storage, just queue storage, hopefully for seconds at a time. We rely on NATS replication between instances plus Benthos archival of Jetstream messages to S3.

Service Mesh

ECS Service Connect is enabled for all tasks, other than NATS, Pilot and Dealer. We use security group ingress/egress rules to control service-level communication permissions, as Service Connect does not provide this. Incoming network traffic to services proxied with ECS Service Connect is default-denied at the security group level. Then we allow traffic from specific security groups and to specific security groups for each service.

Only those service that have a legitimate use case have the ability to reach the public internet (i.e., those services that connect to third party APIs). This is all configured with the task definition security group for each service.

Service Connect is not enabled on NATS tasks. NATS has its own ways of exposing traffic metrics and is not intended to run behind Envoy proxies, which is what Service Connect does behind the scenes.

Service Connect is not enabled for Dealer, which uses Service Discovery to register and deregister. Service Connect is also not enabled for Pilot, which uses Service Discovery (Cloud Map) to connect to Dealer. Pilot instances are registered with the ALB, which does not require Pilot to have either a Service Connect proxy nor a Service Discovery registration.

Observability

The OpenTelemetry collector running as a sidecar receives metrics and traces. We push to Axiom, a third party platform, and try to avoid using Cloudwatch.

Logs are pushed to Axiom via Fluent Bit which is run for each container using AWS FireLens.