AWS Load Balancer Controller: Four Failures Before It Worked

Getting LBC running on EKS required solving four separate problems in sequence: a missing IRSA role, a template placeholder left in values, IMDS unreachable from pods, and subnet tags with a transposed cluster name.

The AWS Load Balancer Controller (LBC) is responsible for provisioning ALBs and NLBs in response to Kubernetes Ingress resources. Getting it running looked simple on paper — install via Helm, point at the cluster — but we hit four distinct failures before the first ALB was created.

Our Setup

We deploy LBC via an ArgoCD ApplicationSet defined in k8s/argocd/apps/infra-appset.yaml. The relevant section:

- name: aws-load-balancer-controller
  namespace: kube-system
  chart: aws-load-balancer-controller
  repoURL: https://aws.github.io/eks-charts
  targetRevision: "1.11.*"
  wave: "1"
  valuesInline: |
    clusterName: mypie-eks-staging
    serviceAccount:
      create: true
      name: aws-load-balancer-controller
      annotations:
        eks.amazonaws.com/role-arn: "REPLACE_WITH_LBC_IRSA_ARN"

Failure 1: No IRSA Role Existed

The first sign of trouble was the pod getting stuck in CrashLoopBackOff. Logs showed:

{"level":"error","ts":"...","msg":"failed to get VPC ID from instance metadata",
"error":"EC2 IMDS is not available, please check the IMDS service"}

Before that could even be addressed, the pod was failing an AWS auth check. We inspected the service account:

kubectl get sa aws-load-balancer-controller -n kube-system -o yaml

The service account existed with the REPLACE_WITH_LBC_IRSA_ARN annotation — which was the literal placeholder string from our template — and it had not been replaced with an actual IAM role ARN.

Root cause: The IRSA role for LBC did not exist yet. We had planned to create it via Terraform but it had been skipped.

Fix: Created the IAM role manually via AWS CLI using the official LBC IAM policy. The role trust relationship uses the OIDC provider of the EKS cluster:

# Get the OIDC provider URL
OIDC_URL=$(aws eks describe-cluster \
  --name mypie-eks-staging \
  --query 'cluster.identity.oidc.issuer' \
  --output text | sed 's|https://||')

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Create trust policy
cat > /tmp/lbc-trust.json << TRUSTEOF
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_URL}"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "${OIDC_URL}:aud": "sts.amazonaws.com",
        "${OIDC_URL}:sub": "system:serviceaccount:kube-system:aws-load-balancer-controller"
      }
    }
  }]
}
TRUSTEOF

aws iam create-role \
  --role-name mypie-eks-staging-aws-load-balancer-controller \
  --assume-role-policy-document file:///tmp/lbc-trust.json

# Download and attach the official policy
curl -o /tmp/lbc-policy.json \
  https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.11.0/docs/install/iam_policy.json

aws iam put-role-policy \
  --role-name mypie-eks-staging-aws-load-balancer-controller \
  --policy-name AWSLoadBalancerControllerPolicy \
  --policy-document file:///tmp/lbc-policy.json

Failure 2: Placeholder Annotation Left in Values

After creating the role, we had the ARN. But we still had REPLACE_WITH_LBC_IRSA_ARN in our infra-appset.yaml. ArgoCD was already managing the deployment and would override any manual changes.

We did two things in parallel:

  1. Updated infra-appset.yaml with the real ARN (line 29):
annotations:
  eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller"
  1. Immediately patched the live service account so we could test without waiting for ArgoCD sync:
kubectl annotate serviceaccount aws-load-balancer-controller \
  -n kube-system \
  eks.amazonaws.com/role-arn=arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller \
  --overwrite

Restarted the pod:

kubectl rollout restart deployment aws-load-balancer-controller -n kube-system

Failure 3: IMDS Unreachable from Pods

After fixing the role and annotation, the pod came back up but was still crashing:

{"level":"error","msg":"failed to get VPC ID",
"error":"couldn't retrieve VPC ID from instance metadata service,
EC2 IMDS is not available"}

LBC auto-detects the VPC ID by calling the EC2 Instance Metadata Service (IMDS) — specifically http://169.254.169.254/latest/meta-data/local-ipv4 to get the node IP, then making an EC2 API call to look up the VPC. This works when running on EC2 directly. Inside a Kubernetes pod with IMDSv2’s hop limit set to 1, the pod cannot reach IMDS (the extra network hop from the container to the node exceeds the hop limit).

Fix: Pass the VPC ID explicitly so LBC does not need to use IMDS to discover it. This is set via the vpcId Helm value.

Updated infra-appset.yaml (line 24):

valuesInline: |
  clusterName: mypie-eks-staging
  vpcId: vpc-<your-vpc-id>
  serviceAccount:
    create: true
    name: aws-load-balancer-controller
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller"

And patched the live deployment directly for immediate effect:

kubectl patch deployment aws-load-balancer-controller -n kube-system \
  --type=json \
  -p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--aws-vpc-id=vpc-<your-vpc-id>"}]'

The pods came up healthy. Then we hit the final problem.

Failure 4: Subnet Tags with a Transposed Cluster Name

LBC was running but when we created an Ingress resource, no ALB was provisioned. The logs showed:

{
  "level": "error",
  "msg": "couldn't auto-discover subnets: unable to resolve at least one subnet",
  "error": "2 match VPC and tags: [kubernetes.io/role/elb], 2 tagged for other cluster"
}

LBC uses two criteria to auto-discover public subnets for ALBs:

  1. Tag kubernetes.io/role/elb = 1 (or internal-elb for private)
  2. Tag kubernetes.io/cluster/<cluster-name> = owned or shared

We had correctly tagged the subnets for criterion 1. But criterion 2 was wrong. Our subnets were tagged:

kubernetes.io/cluster/mypie-staging-eks = owned

The actual cluster name is mypie-eks-stagingeks comes after mypie, not after staging. The two words were transposed.

Fix: Re-tag all 4 subnets (2 public for ALB, 2 private for internal):

SUBNETS=(
  subnet-<public-az-a>
  subnet-<public-az-b>
  subnet-<private-az-a>
  subnet-<private-az-b>
)

for subnet in "${SUBNETS[@]}"; do
  # Remove the wrong tag
  aws ec2 delete-tags \
    --resources "$subnet" \
    --tags Key="kubernetes.io/cluster/mypie-staging-eks"

  # Add the correct tag
  aws ec2 create-tags \
    --resources "$subnet" \
    --tags Key="kubernetes.io/cluster/mypie-eks-staging",Value="owned"

  echo "Fixed $subnet"
done

After re-tagging, LBC reconciled the pending Ingress resources and provisioned the ALB within about 30 seconds:

{"level":"info","msg":"created loadBalancer",
"stackID":"mypie-tools",
"resourceID":"LoadBalancer",
"arn":"arn:aws:elasticloadbalancing:eu-central-1:<account-id>:loadbalancer/app/k8s-mypietools-.../..."}

Commit the Fix to Terraform

After all four manual fixes were validated, we committed the changes to infrastructure-as-code. The subnet tags are managed in the networking module, and the IRSA role was added to the EKS module’s IRSA configuration. The vpcId and correct role-arn are now in infra-appset.yaml:

valuesInline: |
  clusterName: mypie-eks-staging
  vpcId: vpc-<your-vpc-id>
  serviceAccount:
    create: true
    name: aws-load-balancer-controller
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/mypie-eks-staging-aws-load-balancer-controller"
  replicaCount: 2

Summary of All Four Failures

# Error Root Cause Fix
1 IMDS not available / auth error IRSA role did not exist Create IAM role with correct OIDC trust
2 Pod failed to auth to AWS Placeholder REPLACE_WITH_LBC_IRSA_ARN in annotation Update infra-appset.yaml line 29 + kubectl annotate
3 Failed to get VPC ID from IMDS IMDSv2 hop limit blocks pod → IMDS access Add vpcId to Helm values (infra-appset.yaml line 24)
4 Couldn’t auto-discover subnets Subnet tag had transposed cluster name Re-tag subnets to kubernetes.io/cluster/mypie-eks-staging