Release Management: How We Ship Code from Commit to Production

The Goal: Fully Auditable, Automated Deployments

When we started building mypie’s infrastructure, we had one hard requirement: every change to production — whether application code or infrastructure — must be traceable back to a Git commit. No cowboy kubectl apply from a developer’s laptop. No terraform apply run by hand in CI without review.

This post explains the three-layer release system we landed on:

GitHub Actions — build, test, tag, push container images
ArgoCD — deploy application manifests to Kubernetes
Atlantis — plan and apply Terraform on PR review

Layer 1: Application CI — GitHub Actions + ECR

Every service (api-gateway, pie-service, ai-service, etc.) lives in the mypie-platform monorepo. Each has a Dockerfile and a corresponding Helm chart under charts/<service>/.

Staging Flow (every merge to `main`)

Push to main
  ↓
GitHub Actions: .github/workflows/ci.yml
  ↓
Build Docker image
  ↓
Tag: sha-<7char> (e.g. sha-a3f891c)
  ↓
Push to ECR: 589528730153.dkr.ecr.eu-central-1.amazonaws.com/mypie/<service>:sha-a3f891c
  ↓
Commit image tag to: helm/staging/values.yaml [skip ci]
  ↓
ArgoCD detects drift → auto-sync → rolling update

The key is the GitOps commit. After pushing the image, CI mutates the staging values.yaml:

# helm/staging/values.yaml (before)
apiGateway:
  image:
    tag: sha-b2d77e1

# helm/staging/values.yaml (after CI commit)
apiGateway:
  image:
    tag: sha-a3f891c

This commit is made by the GITOPS_TOKEN GitHub secret — a PAT with repo write scope. The [skip ci] trailer prevents an infinite loop.

ArgoCD polls the repo every 3 minutes (or receives a webhook). When it sees staging values.yaml changed, it syncs. For staging, sync is automated — no human needed.

Production Flow (tagged releases)

Production follows the same build step, but the trigger is a v* tag:

git tag v1.4.2
git push origin v1.4.2
  ↓
GitHub Actions: build & push image tagged v1.4.2
  ↓
Commit tag to: helm/production/values.yaml [skip ci]
  ↓
ArgoCD shows production app as OutOfSync
  ↓
Engineer reviews diff in ArgoCD UI
  ↓
Manual sync → rolling update

Production sync is manual by design. We want a human to review what’s changing before it goes live. This gives us a natural release gate without a separate approval system.

# From the ArgoCD UI, or via CLI:
argocd app sync mypie-production --prune

Rollback

Because every deploy is a git commit, rollback is just reverting the values.yaml commit:

# Find the previous tag
git log -- helm/production/values.yaml

# Revert the values.yaml to the previous tag
git revert <commit-sha>
git push origin main
# ArgoCD shows OutOfSync again → manual sync

Layer 2: Kubernetes GitOps — ArgoCD App-of-Apps

ArgoCD runs on the staging cluster (and will be promoted to prod). We use the App-of-Apps pattern: a single root Application points at k8s/argocd/apps/, and each YAML file in that directory is itself an Application or ApplicationSet.

root-app (App of Apps)
  └── k8s/argocd/apps/
        ├── mypie-staging.yaml       → Helm chart for mypie services (staging)
        ├── mypie-production.yaml    → Helm chart for mypie services (prod)
        └── infra-appset.yaml        → ApplicationSet: LBC, cert-manager, metrics-server, Atlantis

Bootstrap (one-time)

# 1. Install ArgoCD via Helm
helm repo add argo https://argoproj.github.io/argo-helm
helm upgrade --install argocd argo/argo-cd \
  -f k8s/argocd/install/values.yaml \
  --namespace argocd --create-namespace \
  --version "7.x.x"

# 2. Deploy the root app — after this, ArgoCD manages itself
kubectl apply -f k8s/argocd/apps/root-app.yaml

Once root-app.yaml is applied, ArgoCD takes over. Any new file added to k8s/argocd/apps/ is automatically picked up on the next sync.

Sync Waves

Infrastructure add-ons have ordering dependencies (cert-manager before Atlantis, LBC before any Ingress). We use ArgoCD sync waves via annotation:

# infra-appset.yaml
- name: aws-load-balancer-controller
  wave: "1"    # deploys first

- name: cert-manager
  wave: "2"    # after LBC

- name: atlantis
  wave: "20"   # last — needs Ingress to already work

ArgoCD processes wave 1 resources, waits for them to become healthy, then moves to wave 2, and so on.

Layer 3: Terraform GitOps — Atlantis

Application code via ArgoCD covers Kubernetes — but what about infrastructure changes? Terraform runs through Atlantis, which lives as a pod in the staging cluster.

How Atlantis Works

Engineer opens a PR in mypie-infra that changes Terraform
Atlantis detects the PR via GitHub webhook
Atlantis runs terraform plan and comments the diff on the PR
Reviewer approves the plan in the PR comment: atlantis apply
Atlantis runs terraform apply, commits the lock file update
Engineer merges the PR

PR created
  ↓
Atlantis webhook fires
  ↓
terraform plan → comment on PR
  ↓
Code review + "atlantis apply" comment
  ↓
terraform apply → updates AWS resources
  ↓
PR merged

The Atlantis pod uses an IRSA role (mypie-eks-staging-atlantis) with AdministratorAccess — scoped to the EKS cluster’s OIDC provider, so only the Atlantis service account can assume it. No AWS keys stored anywhere.

Atlantis Config

# atlantis.yaml (in mypie-infra root)
version: 3
automerge: false
projects:
  - name: staging
    dir: terraform/environments/staging
    workspace: default
    autoplan:
      enabled: true
      when_modified: ["**/*.tf", "../../modules/**/*.tf"]
  - name: prod
    dir: terraform/environments/prod
    workspace: default
    autoplan:
      enabled: true
      when_modified: ["**/*.tf", "../../modules/**/*.tf"]
  - name: dns
    dir: terraform/environments/dns
    workspace: default

Secrets

Atlantis needs credentials to call the GitHub API and run Terraform. These live in a Kubernetes secret (never in Git):

kubectl create secret generic atlantis-secrets \
  --namespace atlantis \
  --from-literal=ATLANTIS_GH_TOKEN=<github-token> \
  --from-literal=ATLANTIS_GH_WEBHOOK_SECRET=<webhook-secret> \
  --from-literal=TF_VAR_cloudflare_api_token=<cf-token>

The Atlantis Helm values reference this secret via envFrom.

The Full Picture

Developer pushes code
       │
       ├─ app change ──→ GA builds image → ECR → values.yaml commit
       │                       │
       │                       └─ staging: ArgoCD auto-syncs
       │                       └─ prod: ArgoCD shows OutOfSync → manual sync
       │
       └─ infra change ──→ PR opened → Atlantis plans → human approves → apply

Every change is a Git commit. Every deploy is reviewable. Every rollback is a revert. The cluster never diverges from what’s in Git for more than 3 minutes.

Lessons

Lesson 1: GitOps commit needs its own token. Using GITHUB_TOKEN (the default CI token) to commit back to the same repo that triggered the workflow causes issues — it won’t trigger subsequent workflows. We created a dedicated GITOPS_TOKEN PAT with repo write for this purpose.

Lesson 2: [skip ci] is mandatory on the values.yaml commit. Without it, the GitOps commit triggers another CI run, which pushes a new image, which commits a new tag, which triggers another run — infinite loop.

Lesson 3: Production sync should be manual. A 2am auto-sync of production is a bad idea. Manual sync gives you a chance to read the ArgoCD diff and say “wait, why is that pod count changing?” before it’s live.

Lesson 4: Atlantis ≠ terraform apply in CI. Running Terraform in CI works, but you lose the PR comment UX and the ability to atlantis plan with different variables for different plans. Atlantis gives you a proper review workflow for infrastructure changes, not just code changes.