CRITICAL

Root Cause Analysis: Why Alibaba Cloud ACK Pods Get OOMKilled

Quick Fix Summary

TL;DR

Increase pod memory limits and requests, then analyze application memory usage patterns.

OOMKilled occurs when a container exceeds its configured memory limit, triggering the Linux kernel's OOM killer. The Kubernetes scheduler then terminates the pod to protect node stability and other workloads.

Diagnosis & Causes

  • Insufficient memory requests/limits in pod spec.
  • Application memory leak or unbounded growth.
  • JVM heap misconfiguration exceeding container limit.
  • Sidecar containers consuming unexpected memory.
  • Node memory pressure from other pods or system daemons.
  • Recovery Steps

    1

    Step 1: Diagnose the OOM Event

    First, confirm the OOMKilled status and examine the pod's recent events and resource configuration.

    bash
    kubectl describe pod <pod-name> -n <namespace>
    kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 -B 5 resources
    2

    Step 2: Analyze Container Memory Usage

    Use kubectl top and ACK's monitoring to see actual vs. requested memory consumption before the kill.

    bash
    kubectl top pod <pod-name> --containers -n <namespace>
    # Check Alibaba Cloud Container Service for Pod/Container metrics
    3

    Step 3: Review and Adjust Memory Limits

    Update the pod's memory limits and requests based on observed usage, adding a safety buffer (e.g., 20-30%).

    bash
    # Example deployment patch
    kubectl patch deployment <deploy-name> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1024Mi"},"requests":{"memory":"512Mi"}}}]}}}}'
    4

    Step 4: Configure JVM Heap for Java Apps

    For JVM-based containers, explicitly set heap size (-Xmx) below the container limit to prevent direct OOM kills.

    yaml
    env:
    - name: JAVA_OPTS
      value: "-Xmx700m -XX:+UseContainerSupport" # For a 1Gi limit container
    5

    Step 5: Implement Liveness & Readiness Probes

    Add memory-sensitive health checks to allow Kubernetes to restart unhealthy pods before a system OOM kill.

    yaml
    livenessProbe:
      exec:
        command:
        - sh
        - -c
        - '[[ $(cat /sys/fs/cgroup/memory/memory.usage_in_bytes) -lt 900000000 ]]' # Check against 900MB
    6

    Step 6: Set Pod Priority and QoS Class

    Assign Guaranteed QoS (limits=requests) to protect critical pods from eviction under node memory pressure.

    yaml
    # In pod spec, set equal memory requests and limits
    resources:
      limits:
        memory: "1Gi"
      requests:
        memory: "1Gi"
    7

    Step 7: Enable and Monitor ACK Node Autoscaling

    Use Cluster Autoscaler to add nodes when existing nodes experience sustained memory pressure.

    bash
    # Ensure Cluster Autoscaler is installed and configured for scale-up based on memory
    # Check via Alibaba Cloud Console or:
    kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

    Architect's Pro Tip

    "The Linux kernel caches disk pages in free memory. Your app's 'actual usage' might be fine, but a sudden cache flush can cause a sharp RSS spike, triggering OOM. Monitor 'working_set' memory, not just 'usage'."

    Frequently Asked Questions

    My pod has high memory limits but still gets OOMKilled. Why?

    Check for memory fragmentation, memory-hungry sidecars, or the kernel killing the process due to overall node memory pressure, not just your container limit.

    What's the difference between OOMKilled and Evicted?

    OOMKilled is triggered by the Linux kernel when a container breaches its memory limit. Evicted is triggered by the kubelet when the node itself runs out of allocatable memory.

    Should I always set memory requests equal to limits?

    Not necessarily. Setting them equal (Guaranteed QoS) provides the highest stability but reduces bin-packing efficiency. Use it for critical, memory-sensitive apps.

    Related Alibaba Cloud Guides