ERROR

Alibaba Cloud RDS: Fix Connection Pool Exhaustion Triggering High-Latency Alerts

Quick Fix Summary

TL;DR

Restart the application to clear stuck connections and temporarily relieve pool pressure.

The application's database connection pool has reached its maximum limit, causing new requests to wait or fail, which manifests as high latency and timeouts.

Diagnosis & Causes

  • Application not closing database connections properly (leaks).
  • Connection pool size is configured too low for the current workload.
  • Recovery Steps

    1

    Step 1: Verify Active Connections and Pool Exhaustion

    Check the current number of active connections on the RDS instance and correlate with application metrics to confirm pool exhaustion.

    bash
    # Check active connections on the RDS instance via DMS or CLI
    aliyun rds DescribeDBInstancePerformance --DBInstanceId <your-instance-id> --Key 'MySQL_ActiveSessions' --StartTime $(date -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) --EndTime $(date +%Y-%m-%dT%H:%M:%SZ)
    # Check CloudMonitor for 'Database Connections' metric and 'Failed to Get Connection' alerts.
    2

    Step 2: Restart Application to Clear Stale Connections

    Immediate recovery step to kill all existing connections and restart the application pool. Use your orchestration tool.

    bash
    # For ECS instances running the app
    sudo systemctl restart <your-application-service>
    # For Kubernetes deployments
    kubectl rollout restart deployment/<your-app-deployment> -n <namespace>
    3

    Step 3: Increase RDS Maximum Connections Limit

    Temporarily increase the instance's `max_connections` parameter to provide immediate headroom while you fix the root cause.

    bash
    # Modify the RDS parameter group (replace with your group ID and instance ID)
    aliyun rds ModifyParameter --ParameterGroupId <pg-id> --Parameters '[{"Name":"max_connections", "Value":"1000"}]'
    # Apply the parameter group to the instance
    aliyun rds ModifyDBInstanceParameter --DBInstanceId <your-instance-id> --Parameters '[{"Name":"max_connections", "Value":"1000"}]' --Forcerestart false
    4

    Step 4: Kill Long-Running or Idle Connections

    Manually terminate problematic connections consuming pool slots using the RDS DMS or SQL commands.

    sql
    # Connect to the RDS instance via DMS or client, then run:
    SHOW PROCESSLIST;
    -- Identify idle/long-running connections and kill them
    KILL <process_id>;
    5

    Step 5: Optimize Application Connection Pool Settings

    Review and adjust the application's connection pool configuration (e.g., HikariCP, Druid) to better match the RDS `max_connections` limit.

    yaml
    # Example: Check application config for pool settings (Spring Boot HikariCP)
    # application.yml snippet:
    spring:
      datasource:
        hikari:
          maximum-pool-size: 20 # Adjust based on RDS max_connections and app instances
          connection-timeout: 30000
          idle-timeout: 600000
          max-lifetime: 1800000
    6

    Step 6: Scale the RDS Instance

    If the workload has permanently increased, scale the instance to a higher specification (CPU/Memory) which supports more connections.

    bash
    # Upgrade the DB instance specification (causes a momentary connection drop)
    aliyun rds ModifyDBInstanceSpec --DBInstanceId <your-instance-id> --DBInstanceClass '<new-spec>' --PayType Postpaid
    7

    Step 7: Implement Connection Health Checks & Timeouts

    Configure the connection pool to validate connections before use and evict idle connections to prevent leaks from affecting the pool.

    properties
    # Example HikariCP health check properties
    spring.datasource.hikari.connection-test-query=SELECT 1
    spring.datasource.hikari.validation-timeout=5000

    Architect's Pro Tip

    "This often happens after a deployment where a code change introduced a connection leak, or when a downstream service slowdown causes transactions to hold connections open for much longer than usual. Always compare the deployment timeline with the connection growth chart in CloudMonitor."

    Frequently Asked Questions

    Will increasing `max_connections` on RDS impact performance?

    Yes, increasing it too high without sufficient instance resources (CPU/RAM) can lead to contention and degrade performance. It's a temporary buffer, not a solution. Always scale the instance spec if you need a permanently higher limit.

    How do I find the root cause of connection leaks?

    Enable detailed logging for your connection pool (e.g., HikariCP leak detection). Use APM tools to trace slow database queries. Review recent code changes, especially around transaction management and connection handling in try-catch-finally blocks.

    Related Alibaba Cloud Guides