CRITICAL

AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion

Quick Fix Summary

TL;DR

Increase target group capacity or scale out healthy targets immediately.

The ALB cannot route traffic because the target group has insufficient resources (e.g., no healthy targets, connection limits exceeded, or insufficient capacity).

Diagnosis & Causes

  • All registered targets are unhealthy or failing health checks.
  • Target group has reached its concurrent connection or request rate limit.
  • Backend instances are at 100% CPU/memory, causing health check failures or timeouts.
  • Recovery Steps

    1

    Step 1: Verify Target Health and Load Balancer Metrics

    Check the health status of targets and review CloudWatch metrics for the ALB and target group to confirm resource exhaustion.

    bash
    # Describe target health for the specific target group
    aws elbv2 describe-target-health --target-group-arn <TARGET_GROUP_ARN>
    # Check ALB CloudWatch metrics for HTTPCode_ELB_5XX_Count and TargetConnectionErrorCount
    aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_ELB_5XX_Count --dimensions Name=LoadBalancer,Value=<ALB_ARN_Suffix> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --period 300 --statistics Sum
    2

    Step 2: Scale Out Healthy Targets

    Increase the number of healthy instances in your Auto Scaling Group (ASG) or manually register new, healthy targets to the group.

    bash
    # Set desired capacity for ASG linked to the target group
    aws autoscaling set-desired-capacity --auto-scaling-group-name <ASG_NAME> --desired-capacity <NEW_CAPACITY> --honor-cooldown
    # Manually register a new EC2 instance to the target group
    aws elbv2 register-targets --target-group-arn <TARGET_GROUP_ARN> --targets Id=<INSTANCE_ID>
    3

    Step 3: Adjust Target Group Health Check Settings

    Make health checks less strict temporarily to allow more targets to pass, but ensure backend can handle traffic. Focus on increasing timeout and interval.

    bash
    # Modify health check for the target group (example: longer timeout, more healthy thresholds)
    aws elbv2 modify-target-group --target-group-arn <TARGET_GROUP_ARN> --health-check-timeout-seconds 10 --health-check-interval-seconds 30 --healthy-threshold-count 3 --unhealthy-threshold-count 2
    4

    Step 4: Review and Increase Backend Capacity

    Check CPU/Memory on backend targets. If saturated, scale vertically (instance size) or optimize application performance.

    bash
    # SSH into a backend instance and check resource usage
    ssh -i <KEY_PEM> ec2-user@<INSTANCE_IP> 'top -bn1 | head -20'
    # Check CloudWatch for EC2 CPU utilization
    aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=InstanceId,Value=<INSTANCE_ID> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --period 300 --statistics Average
    5

    Step 5: Implement Connection Draining and Adjust ALB Timeouts

    Enable connection draining (deregistration delay) on the target group to allow in-flight requests to complete during scaling. Increase ALB idle timeout if clients use long-lived connections.

    bash
    # Enable connection draining (deregistration delay)
    aws elbv2 modify-target-group --target-group-arn <TARGET_GROUP_ARN> --deregistration-delay-seconds 300
    # Modify ALB idle timeout (for the listener/load balancer, may require modifying listener rules or load balancer attributes)
    aws elbv2 modify-load-balancer-attributes --load-balancer-arn <ALB_ARN> --attributes Key=idle_timeout.timeout_seconds,Value=60
    6

    Step 6: Check Security Group and Network ACL Rules

    Ensure the ALB's security group allows outbound traffic to the targets and the targets' security groups allow inbound traffic from the ALB on the health check and application ports.

    bash
    # Describe security groups for ALB and a target instance
    aws ec2 describe-security-groups --group-ids <ALB_SG_ID> <TARGET_SG_ID>

    Architect's Pro Tip

    "This often happens during sudden traffic spikes when Auto Scaling lags. Pre-warm your ASG by proactively scaling based on predictive metrics (e.g., RequestCountPerTarget) rather than just CPU. Also, ensure your health check endpoint is lightweight and doesn't itself fail under load."

    Frequently Asked Questions

    My targets show as 'healthy' but I still get 503s. What's wrong?

    Targets can be healthy but the target group itself may be at capacity. Check the ALB's 'ProcessedBytes' and 'TargetConnectionErrorCount' metrics. The issue might be the backend cannot accept new connections (e.g., max threads, listen queue full) despite passing a simple health check.

    How do I know if I've hit the target group limits?

    AWS has soft limits on targets per ALB and rules per ALB. If you have a very large number of targets (thousands) or complex rules, you may exhaust ALB resources. Check for the CloudWatch metric 'ActiveConnectionCount' or 'TargetConnectionErrorCount' and contact AWS Support to increase limits if necessary.

    Related AWS Guides