Root Cause Analysis: Why Alibaba Cloud SLB Health Check Fails
Quick Fix Summary
TL;DRVerify backend server is running, firewall allows SLB CIDR, and health check path/port is correct.
The Alibaba Cloud Server Load Balancer (SLB) failed to establish a successful health check connection to your backend servers. This causes traffic to be blocked from reaching unhealthy instances, potentially leading to service outages.
Diagnosis & Causes
Recovery Steps
Step 1: Verify Backend Server Status and Listening Ports
First, SSH into the backend ECS instance and confirm the service is running and bound to the correct IP/port expected by the SLB health check.
# Check if service process is running
ps aux | grep [your_service]
# Verify service is listening on the expected port (e.g., 8080)
sudo netstat -tlnp | grep :8080
# Test local connectivity
curl -v http://localhost:8080/health Step 2: Audit Security Group and Instance Firewall Rules
The SLB health check originates from specific CIDR blocks (100.64.0.0/10 and 100.104.0.0/16). You must explicitly allow these IP ranges in your backend instance's security group and host firewall (like iptables).
# 1. Check current iptables rules on the backend server
sudo iptables -L -n -v
# 2. Add a rule to allow the SLB health check CIDR (Example for port 8080)
sudo iptables -I INPUT -s 100.64.0.0/10 -p tcp --dport 8080 -j ACCEPT
sudo iptables -I INPUT -s 100.104.0.0/16 -p tcp --dport 8080 -j ACCEPT
# 3. For Alibaba Cloud Security Groups, add rules for source: 100.64.0.0/10 & 100.104.0.0/16 to your backend server's group. Step 3: Validate and Adjust SLB Health Check Configuration
Log into the Alibaba Cloud Console and meticulously compare your SLB listener's health check settings with your backend application's actual endpoint. A single character mismatch in the path can cause failure.
# Use Alibaba Cloud CLI to check current health check config
aliyun slb DescribeHealthStatus --LoadBalancerId lb-xxx --ListenerPort 80
# To update a health check (example: HTTP, port 8080, path /api/health)
aliyun slb SetHealthCheck --LoadBalancerId lb-xxx --ListenerPort 80 --HealthCheckConnectPort 8080 --HealthCheckHttpVersion http_1_1 --HealthCheckDomain 'yourdomain.com' --HealthCheckURI '/api/health' --HealthyThreshold 3 --UnhealthyThreshold 3 --HealthCheckTimeout 5 --HealthCheckInterval 2 Step 4: Simulate the Health Check from a Test Instance
Create a temporary ECS instance in the same VPC to simulate the SLB's health check probe, isolating network path issues.
# On a test instance in the same VPC, try to reach the backend
# Test TCP connectivity
telnet <backend_private_ip> 8080
# Test HTTP health check
curl -v -H "Host: yourdomain.com" http://<backend_private_ip>:8080/api/health
# Check response time and status code Step 5: Check for Resource Exhaustion and Kernel Parameters
High load on the backend server can cause connection drops. Check for full connection queues, port exhaustion, or restrictive kernel settings.
# Check current connection states to the service port
ss -ant sport = :8080
# Check for dropped packets (look for 'drop' or 'overflow')
netstat -s | grep -i listen
# Review kernel parameters related to connection backlog
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
# Increase backlog if needed (example)
echo 'net.core.somaxconn=1024' | sudo tee -a /etc/sysctl.conf && \
sudo sysctl -p Step 6: Enable and Analyze SLB Access Logs
For Layer 7 (HTTP/HTTPS) listeners, enable access logging to see the exact HTTP request the SLB is sending and the backend's response.
# 1. Enable access log via CLI (specify an OSS bucket)
aliyun slb SetAccessLogsDownloadAttribute --LoadBalancerId lb-xxx --LogDownloadEnabled true --LogProject my-project --LogStore my-store
# 2. After a few minutes, download logs from OSS and look for health check requests (User-Agent: 'healthcheck') and their HTTP status codes. Architect's Pro Tip
"For TCP listeners, the SLB sends a TCP SYN packet for the health check. If your backend uses a connection-oriented protocol like Redis or MySQL, ensure it doesn't misinterpret and reset this bare SYN packet, causing a race condition."
Frequently Asked Questions
Can I use a different port for health checks than the listener port?
Yes. For Layer 4 (TCP/UDP) listeners, you must use the same port. For Layer 7 (HTTP/HTTPS) listeners, you can specify a different backend port for health checks in the listener configuration, which is crucial if your service uses a separate management port.
Why does health check pass locally but fail from SLB?
This is almost always a network security rule issue. The local test uses localhost or the private IP, bypassing the security group and host firewall. The SLB probe comes from its own CIDR block, which must be explicitly allowed in both the ECS security group and the instance's local firewall (e.g., iptables).
How often do health checks occur, and what do the thresholds mean?
The 'Interval' (2-50 sec) is time between checks. 'HealthyThreshold' (1-10) is consecutive successes needed to mark an instance healthy. 'UnhealthyThreshold' (1-10) is consecutive failures to mark it unhealthy. A common pitfall is setting timeouts longer than the interval, causing overlapping checks.