CRITICAL

Root Cause Analysis: Why Alibaba Cloud ACK Pods Get OOMKilled

Quick Fix Summary

TL;DR

Increase pod memory limits and requests, then analyze application memory usage patterns.

OOMKilled occurs when a container exceeds its configured memory limit, triggering the Linux kernel's OOM killer. The Kubernetes scheduler then terminates the pod to protect node stability and other workloads.

Diagnosis & Causes

Insufficient memory requests/limits in pod spec.

Application memory leak or unbounded growth.

JVM heap misconfiguration exceeding container limit.

Sidecar containers consuming unexpected memory.

Node memory pressure from other pods or system daemons.

Recovery Steps

Step 1: Diagnose the OOM Event

First, confirm the OOMKilled status and examine the pod's recent events and resource configuration.

bash

kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 -B 5 resources

Step 2: Analyze Container Memory Usage

Use kubectl top and ACK's monitoring to see actual vs. requested memory consumption before the kill.

bash

kubectl top pod <pod-name> --containers -n <namespace>
# Check Alibaba Cloud Container Service for Pod/Container metrics

Step 3: Review and Adjust Memory Limits

Update the pod's memory limits and requests based on observed usage, adding a safety buffer (e.g., 20-30%).

bash

# Example deployment patch
kubectl patch deployment <deploy-name> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1024Mi"},"requests":{"memory":"512Mi"}}}]}}}}'

Step 4: Configure JVM Heap for Java Apps

For JVM-based containers, explicitly set heap size (-Xmx) below the container limit to prevent direct OOM kills.

yaml

env:
- name: JAVA_OPTS
  value: "-Xmx700m -XX:+UseContainerSupport" # For a 1Gi limit container

Step 5: Implement Liveness & Readiness Probes

Add memory-sensitive health checks to allow Kubernetes to restart unhealthy pods before a system OOM kill.

yaml

livenessProbe:
  exec:
    command:
    - sh
    - -c
    - '[[ $(cat /sys/fs/cgroup/memory/memory.usage_in_bytes) -lt 900000000 ]]' # Check against 900MB

Step 6: Set Pod Priority and QoS Class

Assign Guaranteed QoS (limits=requests) to protect critical pods from eviction under node memory pressure.

yaml

# In pod spec, set equal memory requests and limits
resources:
  limits:
    memory: "1Gi"
  requests:
    memory: "1Gi"

Step 7: Enable and Monitor ACK Node Autoscaling

Use Cluster Autoscaler to add nodes when existing nodes experience sustained memory pressure.

bash

# Ensure Cluster Autoscaler is installed and configured for scale-up based on memory
# Check via Alibaba Cloud Console or:
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

Architect's Pro Tip

"The Linux kernel caches disk pages in free memory. Your app's 'actual usage' might be fine, but a sudden cache flush can cause a sharp RSS spike, triggering OOM. Monitor 'working_set' memory, not just 'usage'."

Frequently Asked Questions

My pod has high memory limits but still gets OOMKilled. Why?

Check for memory fragmentation, memory-hungry sidecars, or the kernel killing the process due to overall node memory pressure, not just your container limit.

What's the difference between OOMKilled and Evicted?

OOMKilled is triggered by the Linux kernel when a container breaches its memory limit. Evicted is triggered by the kubelet when the node itself runs out of allocatable memory.

Should I always set memory requests equal to limits?

Not necessarily. Setting them equal (Guaranteed QoS) provides the highest stability but reduces bin-packing efficiency. Use it for critical, memory-sensitive apps.

Related Alibaba Cloud Guides

InvalidAccessKeyId

Root Cause Analysis: Why Alibaba Cloud ACK Pods Get OOMKilled

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Diagnose the OOM Event

Step 2: Analyze Container Memory Usage

Step 3: Review and Adjust Memory Limits

Step 4: Configure JVM Heap for Java Apps

Step 5: Implement Liveness & Readiness Probes

Step 6: Set Pod Priority and QoS Class

Step 7: Enable and Monitor ACK Node Autoscaling

Architect's Pro Tip

Frequently Asked Questions

My pod has high memory limits but still gets OOMKilled. Why?

What's the difference between OOMKilled and Evicted?

Should I always set memory requests equal to limits?

Related Alibaba Cloud Guides

How to Fix Alibaba Cloud InvalidAccessKeyId.NotFound

How to Fix Alibaba Cloud Throttling.User API Limit

How to Fix Alibaba Cloud Node NotReady