Root Cause Analysis: Why Alibaba Cloud ACK Pods Get OOMKilled
Quick Fix Summary
TL;DRIncrease pod memory limits and requests, then analyze application memory usage patterns.
OOMKilled occurs when a container exceeds its configured memory limit, triggering the Linux kernel's OOM killer. The Kubernetes scheduler then terminates the pod to protect node stability and other workloads.
Diagnosis & Causes
Recovery Steps
Step 1: Diagnose the OOM Event
First, confirm the OOMKilled status and examine the pod's recent events and resource configuration.
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 -B 5 resources Step 2: Analyze Container Memory Usage
Use kubectl top and ACK's monitoring to see actual vs. requested memory consumption before the kill.
kubectl top pod <pod-name> --containers -n <namespace>
# Check Alibaba Cloud Container Service for Pod/Container metrics Step 3: Review and Adjust Memory Limits
Update the pod's memory limits and requests based on observed usage, adding a safety buffer (e.g., 20-30%).
# Example deployment patch
kubectl patch deployment <deploy-name> -n <namespace> -p '{"spec":{"template":{"spec":{"containers":[{"name":"<container-name>","resources":{"limits":{"memory":"1024Mi"},"requests":{"memory":"512Mi"}}}]}}}}' Step 4: Configure JVM Heap for Java Apps
For JVM-based containers, explicitly set heap size (-Xmx) below the container limit to prevent direct OOM kills.
env:
- name: JAVA_OPTS
value: "-Xmx700m -XX:+UseContainerSupport" # For a 1Gi limit container Step 5: Implement Liveness & Readiness Probes
Add memory-sensitive health checks to allow Kubernetes to restart unhealthy pods before a system OOM kill.
livenessProbe:
exec:
command:
- sh
- -c
- '[[ $(cat /sys/fs/cgroup/memory/memory.usage_in_bytes) -lt 900000000 ]]' # Check against 900MB Step 6: Set Pod Priority and QoS Class
Assign Guaranteed QoS (limits=requests) to protect critical pods from eviction under node memory pressure.
# In pod spec, set equal memory requests and limits
resources:
limits:
memory: "1Gi"
requests:
memory: "1Gi" Step 7: Enable and Monitor ACK Node Autoscaling
Use Cluster Autoscaler to add nodes when existing nodes experience sustained memory pressure.
# Ensure Cluster Autoscaler is installed and configured for scale-up based on memory
# Check via Alibaba Cloud Console or:
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml Architect's Pro Tip
"The Linux kernel caches disk pages in free memory. Your app's 'actual usage' might be fine, but a sudden cache flush can cause a sharp RSS spike, triggering OOM. Monitor 'working_set' memory, not just 'usage'."
Frequently Asked Questions
My pod has high memory limits but still gets OOMKilled. Why?
Check for memory fragmentation, memory-hungry sidecars, or the kernel killing the process due to overall node memory pressure, not just your container limit.
What's the difference between OOMKilled and Evicted?
OOMKilled is triggered by the Linux kernel when a container breaches its memory limit. Evicted is triggered by the kubelet when the node itself runs out of allocatable memory.
Should I always set memory requests equal to limits?
Not necessarily. Setting them equal (Guaranteed QoS) provides the highest stability but reduces bin-packing efficiency. Use it for critical, memory-sensitive apps.