AWS Load Balancer Controller: Four Failures Before It Worked

The AWS Load Balancer Controller (LBC) is responsible for provisioning ALBs and NLBs in response to Kubernetes Ingress resources. Getting it running looked simple on paper — install via Helm, point at the cluster — but we hit four distinct failures before the first ALB was created.

Our Setup

We deploy LBC via an ArgoCD ApplicationSet defined in k8s/argocd/apps/infra-appset.yaml. The relevant section:

- name: aws-load-balancer-controller
  namespace: kube-system
  chart: aws-load-balancer-controller
  repoURL: https://aws.github.io/eks-charts
  targetRevision: "1.11.*"
  wave: "1"
  valuesInline: |
    clusterName: mypie-eks-staging
    serviceAccount:
      create: true
      name: aws-load-balancer-controller
      annotations:
        eks.amazonaws.com/role-arn: "REPLACE_WITH_LBC_IRSA_ARN"

Failure 1: No IRSA Role Existed

The first sign of trouble was the pod getting stuck in CrashLoopBackOff. Logs showed:

{"level":"error","ts":"...","msg":"failed to get VPC ID from instance metadata",
"error":"EC2 IMDS is not available, please check the IMDS service"}

Before that could even be addressed, the pod was failing an AWS auth check. We inspected the service account:

kubectl get sa aws-load-balancer-controller -n kube-system -o yaml

The service account existed with the REPLACE_WITH_LBC_IRSA_ARN annotation — which was the literal placeholder string from our template — and it had not been replaced with an actual IAM role ARN.

Root cause: The IRSA role for LBC did not exist yet. We had planned to create it via Terraform but it had been skipped.

Fix: Created the IAM role manually via AWS CLI using the official LBC IAM policy. The role trust relationship uses the OIDC provider of the EKS cluster:

# Get the OIDC provider URL
OIDC_URL=$(aws eks describe-cluster \
  --name mypie-eks-staging \
  --query 'cluster.identity.oidc.issuer' \
  --output text | sed 's|https://||')

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Create trust policy
cat > /tmp/lbc-trust.json << TRUSTEOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_URL}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_URL}:aud": "sts.amazonaws.com",
        "${OIDC_URL}:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"
      }
    }
  }]
}
TRUSTEOF

aws iam create-role \
  --role-name mypie-eks-staging-aws-load-balancer-controller \
  --assume-role-policy-document file:///tmp/lbc-trust.json

# Download and attach the official policy
curl -o /tmp/lbc-policy.json \
  https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.11.0/docs/install/iam_policy.json

aws iam put-role-policy \
  --role-name mypie-eks-staging-aws-load-balancer-controller \
  --policy-name AWSLoadBalancerControllerPolicy \
  --policy-document file:///tmp/lbc-policy.json

Failure 2: Placeholder Annotation Left in Values

After creating the role, we had the ARN. But we still had REPLACE_WITH_LBC_IRSA_ARN in our infra-appset.yaml. ArgoCD was already managing the deployment and would override any manual changes.

We did two things in parallel:

Updated infra-appset.yaml with the real ARN (line 29):

annotations:
  eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller"

Immediately patched the live service account so we could test without waiting for ArgoCD sync:

kubectl annotate serviceaccount aws-load-balancer-controller \
  -n kube-system \
  eks.amazonaws.com/role-arn=arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller \
  --overwrite

Restarted the pod:

kubectl rollout restart deployment aws-load-balancer-controller -n kube-system

Failure 3: IMDS Unreachable from Pods

After fixing the role and annotation, the pod came back up but was still crashing:

{"level":"error","msg":"failed to get VPC ID",
"error":"couldn't retrieve VPC ID from instance metadata service,
EC2 IMDS is not available"}

LBC auto-detects the VPC ID by calling the EC2 Instance Metadata Service (IMDS) — specifically http://169.254.169.254/latest/meta-data/local-ipv4 to get the node IP, then making an EC2 API call to look up the VPC. This works when running on EC2 directly. Inside a Kubernetes pod with IMDSv2’s hop limit set to 1, the pod cannot reach IMDS (the extra network hop from the container to the node exceeds the hop limit).

Fix: Pass the VPC ID explicitly so LBC does not need to use IMDS to discover it. This is set via the vpcId Helm value.

Updated infra-appset.yaml (line 24):

valuesInline: |
  clusterName: mypie-eks-staging
  vpcId: vpc-<your-vpc-id>
  serviceAccount:
    create: true
    name: aws-load-balancer-controller
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller"

And patched the live deployment directly for immediate effect:

kubectl patch deployment aws-load-balancer-controller -n kube-system \
  --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--aws-vpc-id=vpc-<your-vpc-id>"}]'

The pods came up healthy. Then we hit the final problem.

Failure 4: Subnet Tags with a Transposed Cluster Name

LBC was running but when we created an Ingress resource, no ALB was provisioned. The logs showed:

{
  "level": "error",
  "msg": "couldn't auto-discover subnets: unable to resolve at least one subnet",
  "error": "2 match VPC and tags: [kubernetes.io/role/elb], 2 tagged for other cluster"
}

LBC uses two criteria to auto-discover public subnets for ALBs:

Tag kubernetes.io/role/elb = 1 (or internal-elb for private)
Tag kubernetes.io/cluster/<cluster-name> = owned or shared

We had correctly tagged the subnets for criterion 1. But criterion 2 was wrong. Our subnets were tagged:

kubernetes.io/cluster/mypie-staging-eks = owned

The actual cluster name is mypie-eks-staging — eks comes after mypie, not after staging. The two words were transposed.

Fix: Re-tag all 4 subnets (2 public for ALB, 2 private for internal):

SUBNETS=(
  subnet-<public-az-a>
  subnet-<public-az-b>
  subnet-<private-az-a>
  subnet-<private-az-b>
)

for subnet in "${SUBNETS[@]}"; do
  # Remove the wrong tag
  aws ec2 delete-tags \
    --resources "$subnet" \
    --tags Key="kubernetes.io/cluster/mypie-staging-eks"

  # Add the correct tag
  aws ec2 create-tags \
    --resources "$subnet" \
    --tags Key="kubernetes.io/cluster/mypie-eks-staging",Value="owned"

  echo "Fixed $subnet"
done

After re-tagging, LBC reconciled the pending Ingress resources and provisioned the ALB within about 30 seconds:

{"level":"info","msg":"created loadBalancer",
"stackID":"mypie-tools",
"resourceID":"LoadBalancer",
"arn":"arn:aws:elasticloadbalancing:eu-central-1:<account-id>:loadbalancer/app/k8s-mypietools-.../..."}

Commit the Fix to Terraform

After all four manual fixes were validated, we committed the changes to infrastructure-as-code. The subnet tags are managed in the networking module, and the IRSA role was added to the EKS module’s IRSA configuration. The vpcId and correct role-arn are now in infra-appset.yaml:

valuesInline: |
  clusterName: mypie-eks-staging
  vpcId: vpc-<your-vpc-id>
  serviceAccount:
    create: true
    name: aws-load-balancer-controller
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller"
  replicaCount: 2

Summary of All Four Failures

#	Error	Root Cause	Fix
1	IMDS not available / auth error	IRSA role did not exist	Create IAM role with correct OIDC trust
2	Pod failed to auth to AWS	Placeholder `REPLACE_WITH_LBC_IRSA_ARN` in annotation	Update `infra-appset.yaml` line 29 + `kubectl annotate`
3	Failed to get VPC ID from IMDS	IMDSv2 hop limit blocks pod → IMDS access	Add `vpcId` to Helm values (`infra-appset.yaml` line 24)
4	Couldn’t auto-discover subnets	Subnet tag had transposed cluster name	Re-tag subnets to `kubernetes.io/cluster/mypie-eks-staging`