CRITICAL

AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion

Quick Fix Summary

TL;DR

Increase target group capacity or scale out healthy targets immediately.

The ALB cannot route traffic because the target group has insufficient resources (e.g., no healthy targets, connection limits exceeded, or insufficient capacity).

Diagnosis & Causes

All registered targets are unhealthy or failing health checks.

Target group has reached its concurrent connection or request rate limit.

Backend instances are at 100% CPU/memory, causing health check failures or timeouts.

Recovery Steps

Step 1: Verify Target Health and Load Balancer Metrics

Check the health status of targets and review CloudWatch metrics for the ALB and target group to confirm resource exhaustion.

bash

# Describe target health for the specific target group
aws elbv2 describe-target-health --target-group-arn <TARGET_GROUP_ARN>
# Check ALB CloudWatch metrics for HTTPCode_ELB_5XX_Count and TargetConnectionErrorCount
aws cloudwatch get-metric-statistics --namespace AWS/ApplicationELB --metric-name HTTPCode_ELB_5XX_Count --dimensions Name=LoadBalancer,Value=<ALB_ARN_Suffix> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --period 300 --statistics Sum

Step 2: Scale Out Healthy Targets

Increase the number of healthy instances in your Auto Scaling Group (ASG) or manually register new, healthy targets to the group.

bash

# Set desired capacity for ASG linked to the target group
aws autoscaling set-desired-capacity --auto-scaling-group-name <ASG_NAME> --desired-capacity <NEW_CAPACITY> --honor-cooldown
# Manually register a new EC2 instance to the target group
aws elbv2 register-targets --target-group-arn <TARGET_GROUP_ARN> --targets Id=<INSTANCE_ID>

Step 3: Adjust Target Group Health Check Settings

Make health checks less strict temporarily to allow more targets to pass, but ensure backend can handle traffic. Focus on increasing timeout and interval.

bash

# Modify health check for the target group (example: longer timeout, more healthy thresholds)
aws elbv2 modify-target-group --target-group-arn <TARGET_GROUP_ARN> --health-check-timeout-seconds 10 --health-check-interval-seconds 30 --healthy-threshold-count 3 --unhealthy-threshold-count 2

Step 4: Review and Increase Backend Capacity

Check CPU/Memory on backend targets. If saturated, scale vertically (instance size) or optimize application performance.

bash

# SSH into a backend instance and check resource usage
ssh -i <KEY_PEM> ec2-user@<INSTANCE_IP> 'top -bn1 | head -20'
# Check CloudWatch for EC2 CPU utilization
aws cloudwatch get-metric-statistics --namespace AWS/EC2 --metric-name CPUUtilization --dimensions Name=InstanceId,Value=<INSTANCE_ID> --start-time $(date -d '1 hour ago' +%s) --end-time $(date +%s) --period 300 --statistics Average

Step 5: Implement Connection Draining and Adjust ALB Timeouts

Enable connection draining (deregistration delay) on the target group to allow in-flight requests to complete during scaling. Increase ALB idle timeout if clients use long-lived connections.

bash

# Enable connection draining (deregistration delay)
aws elbv2 modify-target-group --target-group-arn <TARGET_GROUP_ARN> --deregistration-delay-seconds 300
# Modify ALB idle timeout (for the listener/load balancer, may require modifying listener rules or load balancer attributes)
aws elbv2 modify-load-balancer-attributes --load-balancer-arn <ALB_ARN> --attributes Key=idle_timeout.timeout_seconds,Value=60

Step 6: Check Security Group and Network ACL Rules

Ensure the ALB's security group allows outbound traffic to the targets and the targets' security groups allow inbound traffic from the ALB on the health check and application ports.

bash

# Describe security groups for ALB and a target instance
aws ec2 describe-security-groups --group-ids <ALB_SG_ID> <TARGET_SG_ID>

Architect's Pro Tip

"This often happens during sudden traffic spikes when Auto Scaling lags. Pre-warm your ASG by proactively scaling based on predictive metrics (e.g., RequestCountPerTarget) rather than just CPU. Also, ensure your health check endpoint is lightweight and doesn't itself fail under load."

Frequently Asked Questions

My targets show as 'healthy' but I still get 503s. What's wrong?

Targets can be healthy but the target group itself may be at capacity. Check the ALB's 'ProcessedBytes' and 'TargetConnectionErrorCount' metrics. The issue might be the backend cannot accept new connections (e.g., max threads, listen queue full) despite passing a simple health check.

How do I know if I've hit the target group limits?

AWS has soft limits on targets per ALB and rules per ALB. If you have a very large number of targets (thousands) or complex rules, you may exhaust ALB resources. Check for the CloudWatch metric 'ActiveConnectionCount' or 'TargetConnectionErrorCount' and contact AWS Support to increase limits if necessary.

Related AWS Guides

AccessDeniedException

AWS Application Load Balancer: Fix 503 Service Unavailable due to Target Group Resource Exhaustion

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Verify Target Health and Load Balancer Metrics

Step 2: Scale Out Healthy Targets

Step 3: Adjust Target Group Health Check Settings

Step 4: Review and Increase Backend Capacity

Step 5: Implement Connection Draining and Adjust ALB Timeouts

Step 6: Check Security Group and Network ACL Rules

Architect's Pro Tip

Frequently Asked Questions

My targets show as 'healthy' but I still get 503s. What's wrong?

How do I know if I've hit the target group limits?

Related AWS Guides

How to Fix AWS AccessDeniedException Error

AWS EKS: Fix Intermittent Pod Evictions due to Resource Exhaustion in Multi-Tenant Clusters

Troubleshooting HTTP 502 Errors from ALB Despite Target Health Checks Passing