CRITICAL

How to Fix Alibaba Cloud Node NotReady

Quick Fix Summary

TL;DR

Restart the kubelet service and check node resource constraints.

A Kubernetes node enters the NotReady state when the kubelet cannot report its status to the control plane. This indicates a critical failure preventing pods from being scheduled or running on that node.

Diagnosis & Causes

Kubelet service crash or failure.

Insufficient system resources (memory/disk).

Network connectivity issues to API server.

Docker or containerd runtime failure.

Node OS kernel panic or hardware failure.

Recovery Steps

Step 1: Diagnose Node Status & Kubelet Health

First, get the detailed node condition and check if the kubelet process is running. This identifies if the issue is local to the node.

bash

kubectl describe node <node-name> | grep -A 10 Conditions
ssh <node-ip> systemctl status kubelet --no-pager
ssh <node-ip> journalctl -u kubelet --since "5 minutes ago" --no-pager

Step 2: Restart Kubelet and Container Runtime

If the kubelet is frozen or the runtime is hung, a restart is the fastest path to recovery. Always check logs after restart.

bash

ssh <node-ip> systemctl restart containerd
ssh <node-ip> systemctl restart kubelet
sleep 30&&kubectl get node <node-name>

Step 3: Check for Critical Resource Pressure

Disk pressure (especially /var) and memory exhaustion are common silent killers. Use node shell to inspect.

bash

ssh <node-ip> df -h /var/lib/containerd /var/lib/kubelet
ssh <node-ip> free -h
ssh <node-ip> cat /sys/fs/cgroup/memory/kubepods/memory.pressure_level

Step 4: Cordon, Drain, and Reboot Node (Last Resort)

If the node is unrecoverable, safely evict workloads and perform a hard reboot. This is the nuclear option for persistent issues.

bash

kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
ssh <node-ip> reboot
# Wait for reboot, then uncordon
kubectl uncordon <node-name>

Architect's Pro Tip

"In Alibaba Cloud ACK, check the 'Node Repair' feature in the console first. It can automatically diagnose and fix common underlying ECS issues like disk full or network misconfiguration."

Frequently Asked Questions

How long should I wait after restarting kubelet before declaring the fix failed?

Allow 2-3 minutes. The kubelet needs time to restart containers, re-register with the API server, and pass its periodic health check. If status doesn't change to Ready, proceed to resource checks and drain.

Will draining a node cause application downtime?

If your deployments have multiple replicas and use a PodDisruptionBudget, draining will be graceful with zero downtime. Always check PDBs (`kubectl get pdb --all-namespaces`) before draining a production node.

What's the most common root cause for Node NotReady in Alibaba Cloud?

Disk pressure on the system disk (especially /var/lib/containerd). Alibaba Cloud ECS instances often have a small 40GB system disk by default. Monitor and expand it preemptively.

Related Alibaba Cloud Guides

InvalidAccessKeyId

How to Fix Alibaba Cloud Node NotReady

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Diagnose Node Status & Kubelet Health

Step 2: Restart Kubelet and Container Runtime

Step 3: Check for Critical Resource Pressure

Step 4: Cordon, Drain, and Reboot Node (Last Resort)

Architect's Pro Tip

Frequently Asked Questions

How long should I wait after restarting kubelet before declaring the fix failed?

Will draining a node cause application downtime?

What's the most common root cause for Node NotReady in Alibaba Cloud?

Related Alibaba Cloud Guides

How to Fix Alibaba Cloud InvalidAccessKeyId.NotFound

How to Fix Alibaba Cloud Throttling.User API Limit

Root Cause Analysis: Why Alibaba Cloud ACK Pods Get OOMKilled