Maintenance Checklist

Platform Maintenance Checklist

Note that this checklist applies to all production environments, so as we have more private deployments, we will want to automate or at least dashboard some of the below steps.

Weekly

AWS

  • Verify that the disaster recovery buckets for database backups and user content are up to date on replication from source buckets
  • Request production account limit increases, if needed:
    • SES - per-day and per-second limits (if we fail to increase this in time, sign in with email breaks)

AWS Aurora PostgreSQL

  • Verify that Aurora PostgreSQL backups are up to date in the AWS Console
  • Review Aurora cluster metrics and adjust serverless capacity if needed
  • Monitor global database replication status
  • Review Performance Insights and query performance recommendations

Turbopuffer

  • Review Turbopuffer usage metrics (requests, storage) and ensure billing is as expected.
  • Check for any service announcements or required updates from Turbopuffer.

Axiom

  • Review application logs/traces, verify all are successfully being ingested
  • Review AWS service logs/metrics, verify all are successfully being ingested

Sentry

  • Resolve issues and add new issues to Linear

Rootly

  • Verify on-call schedule is up to date
  • Verify incidents are being properly managed and closed
  • Ensure status page reflects current service health

Monthly

AWS

ECS

  • Check Compute Savings Plan usage in organization AWS account and adjusted if needed.
  • Verify service quotes are sufficient to support scaling Fargate workloads
  • Upgrade Flipt, OpenTelemetry, and Fluent Bit versions in ECS task definitions
  • Adjust ECS task definition memory and CPU reservations for each service as needed to optimize.

Aurora PostgreSQL

  • Verify automatic minor version upgrades are enabled and on track
  • Plan and execute major version upgrades (after testing)
  • Monitor serverless scaling patterns and adjust min/max ACU as needed

Axiom

  • Review LiveKit metrics, pass list of high usage organizations and users to Customer Success for possible abuse investigation.

Temporal Cloud

  • Check Temporal Cloud certificate expiration date, update Cloud and client certs if needed (this might happen automatically through Terraform runs)

DNS

  • Verify no changes to Cloudflare IP range (opens in a new tab). If new IP added/old IP remove, update Terraform for both frontend WAF and backend ALB security group. (We use Cloudflare as the exclusive source of frontend CloudFront and backend ALB traffic.)
  • Check that exp.pivot.app is working (PostHog proxy)
  • Check that quality.pivot.app is working (Sentry proxy)