How to Fix Alibaba Cloud Node NotReady
Quick Fix Summary
TL;DRRestart the kubelet service and check node resource constraints.
A Kubernetes node enters the NotReady state when the kubelet cannot report its status to the control plane. This indicates a critical failure preventing pods from being scheduled or running on that node.
Diagnosis & Causes
Recovery Steps
Step 1: Diagnose Node Status & Kubelet Health
First, get the detailed node condition and check if the kubelet process is running. This identifies if the issue is local to the node.
kubectl describe node <node-name> | grep -A 10 Conditions
ssh <node-ip> systemctl status kubelet --no-pager
ssh <node-ip> journalctl -u kubelet --since "5 minutes ago" --no-pager Step 2: Restart Kubelet and Container Runtime
If the kubelet is frozen or the runtime is hung, a restart is the fastest path to recovery. Always check logs after restart.
ssh <node-ip> systemctl restart containerd
ssh <node-ip> systemctl restart kubelet
sleep 30&&kubectl get node <node-name> Step 3: Check for Critical Resource Pressure
Disk pressure (especially /var) and memory exhaustion are common silent killers. Use node shell to inspect.
ssh <node-ip> df -h /var/lib/containerd /var/lib/kubelet
ssh <node-ip> free -h
ssh <node-ip> cat /sys/fs/cgroup/memory/kubepods/memory.pressure_level Step 4: Cordon, Drain, and Reboot Node (Last Resort)
If the node is unrecoverable, safely evict workloads and perform a hard reboot. This is the nuclear option for persistent issues.
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
ssh <node-ip> reboot
# Wait for reboot, then uncordon
kubectl uncordon <node-name> Architect's Pro Tip
"In Alibaba Cloud ACK, check the 'Node Repair' feature in the console first. It can automatically diagnose and fix common underlying ECS issues like disk full or network misconfiguration."
Frequently Asked Questions
How long should I wait after restarting kubelet before declaring the fix failed?
Allow 2-3 minutes. The kubelet needs time to restart containers, re-register with the API server, and pass its periodic health check. If status doesn't change to Ready, proceed to resource checks and drain.
Will draining a node cause application downtime?
If your deployments have multiple replicas and use a PodDisruptionBudget, draining will be graceful with zero downtime. Always check PDBs (`kubectl get pdb --all-namespaces`) before draining a production node.
What's the most common root cause for Node NotReady in Alibaba Cloud?
Disk pressure on the system disk (especially /var/lib/containerd). Alibaba Cloud ECS instances often have a small 40GB system disk by default. Monitor and expand it preemptively.