Infra Provisioning

Infrastructure Provisioning

Terraform IaC

Pivot uses Terraform for all cloud infra that supports it, via the pivot-internal GitHub repository's connection to multiple Terraform Cloud workspaces.

The Terraform GitHub integration allows us to plan on PRs and then apply to the staging environment on pushes to the main branch.

Avoid using the Terraform CLI locally even on non-production environments as Terraform Cloud provides a single source of truth for our Terraform apply history.

Services Managed with Terraform

AWS

Each Pivot deployment is its own AWS account. We use Terraform for the creation of each account, but not for general billing/IAM configuration of the management account of the AWS organization itself.

Cloudflare

  • DNS: The zone representing each domain was created manually as were some DNS records, so Terraform is only used to manage specific records.
  • WAF: TBD
  • Load Balancer: Cloudflare Load Balancer is used for the multi-tenant backend endpoints to support future use of multi-region. This is managed via Terraform.
  • R2:
  • Access: Cloudflare Access rules are configured manually.

Rootly

Rootly is used for incident management and our status page. Service disruptions detected through our monitoring systems are managed through the Rootly platform.

Terraform Environment and Application CI/CD

We use Terraform to deploy infrastructure into each AWS environment as well as to deploy each new image version for ECS services. There is no technical distinction made here; image version increments are performed with terraform apply the same way infrastructure changes are. Therefore, our CI/CD configuration needs to accomplish the following:

  1. Terraform configuration needs to be managed distinctly for our staging environment, multi-tenant production environment, and our single-tenant environment template. At the same time, our Terraform configuration should be as DRY as possible.
  2. We need to deploy to staging first, and only then deploy to the multiple production environments after an arbitrary amount of time for E2E testing against staging.
  3. Staging deployment needs to run whenever new image versions are pushed as well as whenever it is otherwise desired. Production deployments need to follow successful staging deployments without manual intervention for image version bumps, but also with flexibility for testing against the staging backend and for manual approval if more than just an image version was changed.
  4. Changes to a backend service's source code (or to the code of a library the service depends on) in the pivot repo need to trigger a new Docker image to be built and pushed to ECR and then the Terraform file that defines the service's task definition needs to be updated with the new image tag (thereby triggering staging deployment via Terraform).
  5. Each service should be deployed independently. A change to one should not trigger redeployment of all.
  6. CD workflow concurrency needs to be managed. A new run should not cancel an old run, when it comes to building and pushing images and updating .tf files, because cancellation of in-progress creates a lot of idempotence and race condition complexity to consider.

To accomplish the above, we use three GitHub Actions workflows along with Terraform Cloud.

Workflow #1: deploy-docker-services.yml

The pivot repo has a single workflow that can generically, using a Matrix strategy, determine which applications are affected by a given commit to main, build a Docker image, and push it to our ECR repository. It uses the service name determine where to find the Dockerfile to build and it uses the Git SHA that triggered it to name the image. The service name is the ECR repository name and the Git SHA is the tag. Once the image is pushed successfully, workflow #2 is triggered, passing in the Git SHA and service name.

Workflow #2: update-docker-image-tag.yml

We still need to update our Terraform config to use the new image tag. This pivot-internal workflow takes in a service name and Git SHA (technically the Git SHA could be any string as it is just used as an image tag) and uses those values to find the relevant file using the expected path libs/terraform/services/servicename/main.tf. This workflow simply updates the image tag by editing the .tf file and making a commit to the main branch.

Workflow #3: deploy-terraform-backend.yml

We need a way to deploy to staging, run E2E tests against staging and then deploy to production. Terraform Cloud does not provide such a pipelining mechanism. Therefore, we 'manually' run terraform plan and terraform apply for each Terraform Cloud workspace inside GitHub Actions with the Terraform CLI.

This workflow is triggered by any commit to main that modified .tf files related to the AWS backend, including but not limited to those commits made automatically by the prior workflow.

After deploying to staging and running tests against staging, it runs plan on the multi-tenant production environment. Then, it loops through an array of all the single-tenant workspace names that are stored in apps/terraform-aws-single-backend/workspaces-array.json. For each value, we run plan against the relevant Terraform Cloud workspace. (If the only change is an image tag bump apply is also ran automatically.

Services Managed Manually

  • Axiom (all environments write to one Axiom account)
  • Terraform Cloud (each workspace is created manually)
  • AWS IAM Identity Center (connection to JumpCloud and sub-account permissions per JumpCloud user group)
  • Mux (all production environments write to one Mux account, as Mux doesn't provide regional storage, so there is no need to have Mux accounts per tenant)
  • LiveKit Cloud (all production enviornments use their own LiveKit Cloud project, with its own API key)
  • Sentry (All environments write to one Sentry account - there is no long term storage of customer data in Sentry.)
  • Rootly
  • Expo (single Expo account for our mobile app - no long term storage of customer data)
  • Google Cloud (Sign in with Google API keys)
  • PostHog (single PostHog account)
  • OpenAI (single API account - no long term storage of customer data)
  • AssemblyAI (single API account - no long term storage of customer data)
  • Various SaaS apps that aren't production infrastructure themselves like Google Workspace and JumpCloud

Cloudflare Pages for Frontend Sites and PivotAdmin

We use Cloudflare Pages to host our websites and static frontend assets. These are deployed to a single production environment, as even single-tenant backends use them.

Cloudflare Pages is configured via Terraform, with its own Terraform Cloud workspace. Each application is deployed to Cloudflare Pages via its GitHub integration, including PR environments, for both the pivot and pivot-internal repos.

  1. Docs: Cloudflare Pages deploys the docs site for each PR and on push to main.

  2. Marketing: Cloudflare Pages deploys the marketing site for each PR and on push to main.

  3. Engbook: Cloudflare Pages deploys the EngBook for each PR and on push to main.

  4. Web App: Cloudflare Pages deploys the frontend Pivot app (Expo web export) for each PR and on push to main.

  5. PivotAdmin: Cloudflare Pages deploys the PivotAdmin internal tool for each PR and on push to main.

  6. Storybook: Cloudflare Pages deploys Storybook for each PR and on push to main.

Supabase for PivotAdmin

Our Supabase project uses their GitHub integration with the pivot-internal repository for branching and migrations.