CRITICAL

Root Cause Analysis: Why Alibaba Cloud SLB Health Check Fails

Quick Fix Summary

TL;DR

Verify backend server is running, firewall allows SLB CIDR, and health check path/port is correct.

The Alibaba Cloud Server Load Balancer (SLB) failed to establish a successful health check connection to your backend servers. This causes traffic to be blocked from reaching unhealthy instances, potentially leading to service outages.

Diagnosis & Causes

  • Backend server process not listening on health check port.
  • Security group or instance firewall blocking SLB probe IPs.
  • Health check configuration mismatch (path, port, protocol).
  • High backend server load causing timeout or TCP reset.
  • Network ACL or VPC route table misconfiguration.
  • Recovery Steps

    1

    Step 1: Verify Backend Server Status and Listening Ports

    First, SSH into the backend ECS instance and confirm the service is running and bound to the correct IP/port expected by the SLB health check.

    bash
    # Check if service process is running
    ps aux | grep [your_service]
    # Verify service is listening on the expected port (e.g., 8080)
    sudo netstat -tlnp | grep :8080
    # Test local connectivity
    curl -v http://localhost:8080/health
    2

    Step 2: Audit Security Group and Instance Firewall Rules

    The SLB health check originates from specific CIDR blocks (100.64.0.0/10 and 100.104.0.0/16). You must explicitly allow these IP ranges in your backend instance's security group and host firewall (like iptables).

    bash
    # 1. Check current iptables rules on the backend server
    sudo iptables -L -n -v
    # 2. Add a rule to allow the SLB health check CIDR (Example for port 8080)
    sudo iptables -I INPUT -s 100.64.0.0/10 -p tcp --dport 8080 -j ACCEPT
    sudo iptables -I INPUT -s 100.104.0.0/16 -p tcp --dport 8080 -j ACCEPT
    # 3. For Alibaba Cloud Security Groups, add rules for source: 100.64.0.0/10 & 100.104.0.0/16 to your backend server's group.
    3

    Step 3: Validate and Adjust SLB Health Check Configuration

    Log into the Alibaba Cloud Console and meticulously compare your SLB listener's health check settings with your backend application's actual endpoint. A single character mismatch in the path can cause failure.

    bash
    # Use Alibaba Cloud CLI to check current health check config
    aliyun slb DescribeHealthStatus --LoadBalancerId lb-xxx --ListenerPort 80
    # To update a health check (example: HTTP, port 8080, path /api/health)
    aliyun slb SetHealthCheck --LoadBalancerId lb-xxx --ListenerPort 80 --HealthCheckConnectPort 8080 --HealthCheckHttpVersion http_1_1 --HealthCheckDomain 'yourdomain.com' --HealthCheckURI '/api/health' --HealthyThreshold 3 --UnhealthyThreshold 3 --HealthCheckTimeout 5 --HealthCheckInterval 2
    4

    Step 4: Simulate the Health Check from a Test Instance

    Create a temporary ECS instance in the same VPC to simulate the SLB's health check probe, isolating network path issues.

    bash
    # On a test instance in the same VPC, try to reach the backend
    # Test TCP connectivity
    telnet <backend_private_ip> 8080
    # Test HTTP health check
    curl -v -H "Host: yourdomain.com" http://<backend_private_ip>:8080/api/health
    # Check response time and status code
    5

    Step 5: Check for Resource Exhaustion and Kernel Parameters

    High load on the backend server can cause connection drops. Check for full connection queues, port exhaustion, or restrictive kernel settings.

    bash
    # Check current connection states to the service port
    ss -ant sport = :8080
    # Check for dropped packets (look for 'drop' or 'overflow')
    netstat -s | grep -i listen
    # Review kernel parameters related to connection backlog
    sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog
    # Increase backlog if needed (example)
    echo 'net.core.somaxconn=1024' | sudo tee -a /etc/sysctl.conf && \
      sudo sysctl -p
    6

    Step 6: Enable and Analyze SLB Access Logs

    For Layer 7 (HTTP/HTTPS) listeners, enable access logging to see the exact HTTP request the SLB is sending and the backend's response.

    bash
    # 1. Enable access log via CLI (specify an OSS bucket)
    aliyun slb SetAccessLogsDownloadAttribute --LoadBalancerId lb-xxx --LogDownloadEnabled true --LogProject my-project --LogStore my-store
    # 2. After a few minutes, download logs from OSS and look for health check requests (User-Agent: 'healthcheck') and their HTTP status codes.

    Architect's Pro Tip

    "For TCP listeners, the SLB sends a TCP SYN packet for the health check. If your backend uses a connection-oriented protocol like Redis or MySQL, ensure it doesn't misinterpret and reset this bare SYN packet, causing a race condition."

    Frequently Asked Questions

    Can I use a different port for health checks than the listener port?

    Yes. For Layer 4 (TCP/UDP) listeners, you must use the same port. For Layer 7 (HTTP/HTTPS) listeners, you can specify a different backend port for health checks in the listener configuration, which is crucial if your service uses a separate management port.

    Why does health check pass locally but fail from SLB?

    This is almost always a network security rule issue. The local test uses localhost or the private IP, bypassing the security group and host firewall. The SLB probe comes from its own CIDR block, which must be explicitly allowed in both the ECS security group and the instance's local firewall (e.g., iptables).

    How often do health checks occur, and what do the thresholds mean?

    The 'Interval' (2-50 sec) is time between checks. 'HealthyThreshold' (1-10) is consecutive successes needed to mark an instance healthy. 'UnhealthyThreshold' (1-10) is consecutive failures to mark it unhealthy. A common pitfall is setting timeouts longer than the interval, causing overlapping checks.

    Related Alibaba Cloud Guides