CRITICAL

How to Fix Alibaba Cloud Node NotReady

Quick Fix Summary

TL;DR

Restart the kubelet service and check node resource constraints.

A Kubernetes node enters the NotReady state when the kubelet cannot report its status to the control plane. This indicates a critical failure preventing pods from being scheduled or running on that node.

Diagnosis & Causes

  • Kubelet service crash or failure.
  • Insufficient system resources (memory/disk).
  • Network connectivity issues to API server.
  • Docker or containerd runtime failure.
  • Node OS kernel panic or hardware failure.
  • Recovery Steps

    1

    Step 1: Diagnose Node Status & Kubelet Health

    First, get the detailed node condition and check if the kubelet process is running. This identifies if the issue is local to the node.

    bash
    kubectl describe node <node-name> | grep -A 10 Conditions
    ssh <node-ip> systemctl status kubelet --no-pager
    ssh <node-ip> journalctl -u kubelet --since "5 minutes ago" --no-pager
    2

    Step 2: Restart Kubelet and Container Runtime

    If the kubelet is frozen or the runtime is hung, a restart is the fastest path to recovery. Always check logs after restart.

    bash
    ssh <node-ip> systemctl restart containerd
    ssh <node-ip> systemctl restart kubelet
    sleep 30&&kubectl get node <node-name>
    3

    Step 3: Check for Critical Resource Pressure

    Disk pressure (especially /var) and memory exhaustion are common silent killers. Use node shell to inspect.

    bash
    ssh <node-ip> df -h /var/lib/containerd /var/lib/kubelet
    ssh <node-ip> free -h
    ssh <node-ip> cat /sys/fs/cgroup/memory/kubepods/memory.pressure_level
    4

    Step 4: Cordon, Drain, and Reboot Node (Last Resort)

    If the node is unrecoverable, safely evict workloads and perform a hard reboot. This is the nuclear option for persistent issues.

    bash
    kubectl cordon <node-name>
    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
    ssh <node-ip> reboot
    # Wait for reboot, then uncordon
    kubectl uncordon <node-name>

    Architect's Pro Tip

    "In Alibaba Cloud ACK, check the 'Node Repair' feature in the console first. It can automatically diagnose and fix common underlying ECS issues like disk full or network misconfiguration."

    Frequently Asked Questions

    How long should I wait after restarting kubelet before declaring the fix failed?

    Allow 2-3 minutes. The kubelet needs time to restart containers, re-register with the API server, and pass its periodic health check. If status doesn't change to Ready, proceed to resource checks and drain.

    Will draining a node cause application downtime?

    If your deployments have multiple replicas and use a PodDisruptionBudget, draining will be graceful with zero downtime. Always check PDBs (`kubectl get pdb --all-namespaces`) before draining a production node.

    What's the most common root cause for Node NotReady in Alibaba Cloud?

    Disk pressure on the system disk (especially /var/lib/containerd). Alibaba Cloud ECS instances often have a small 40GB system disk by default. Monitor and expand it preemptively.

    Related Alibaba Cloud Guides