ERROR

Kubernetes Troubleshooting Guide: Diagnosing Pod FailedScheduling Errors

Quick Fix Summary

TL;DR

Check node resource availability and taint/toleration mismatches using `kubectl describe pod` and `kubectl get nodes`.

FailedScheduling occurs when the Kubernetes scheduler cannot find a suitable node to place a Pod. This is a pre-runtime error that prevents the Pod from starting.

Diagnosis & Causes

  • Insufficient CPU or Memory resources on nodes.
  • NodeSelector or NodeAffinity rules not matching any node.
  • Taint on nodes without corresponding Pod toleration.
  • No nodes are in a Ready state to accept workloads.
  • Resource requests exceed available node capacity.
  • Recovery Steps

    1

    Step 1: Inspect the Pod Event Log

    The `kubectl describe pod` command reveals the scheduler's specific reason for failure in the Events section.

    bash
    kubectl describe pod <pod-name> -n <namespace>
    # Look for lines like:
    # Events:
    #   Type     Reason            Age   From               Message
    #   Warning  FailedScheduling  10s   default-scheduler  0/3 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match Pod's node affinity.
    2

    Step 2: Check Node Resource Availability

    Compare the Pod's resource requests against the allocatable resources of your cluster nodes.

    bash
    kubectl get nodes
    kubectl describe node <node-name>
    # In the output, check:
    # Allocatable:
    #   cpu:                940m
    #   memory:             5442344Ki
    # Compare this to your Pod's `spec.containers[].resources.requests`.
    3

    Step 3: Verify Node Selectors, Affinity, and Taints

    Ensure your Pod's placement constraints (affinity/selectors) are compatible with node labels and taints.

    bash
    # Check Pod's placement rules
    kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 -B 5 'nodeSelector\|affinity\|tolerations'
    # Check a Node's labels and taints
    kubectl describe node <node-name> | grep -A 10 -B 5 'Labels\|Taints:'
    4

    Step 4: Check Node Status and Conditions

    A node must be in a 'Ready' state to be schedulable. DiskPressure, MemoryPressure, or NetworkUnready can prevent scheduling.

    bash
    kubectl get nodes
    kubectl describe node <node-name> | grep -A 10 'Conditions:'
    # Look for:
    # Conditions:
    #   Type             Status  LastHeartbeatTime
    #   Ready            True    ... (GOOD)
    #   MemoryPressure   False   ... (GOOD)
    #   DiskPressure     False   ... (GOOD)
    5

    Step 5: Diagnose with Scheduler Logs (Advanced)

    For complex issues, increase the scheduler's verbosity to see its internal decision-making process.

    bash
    # Edit the kube-scheduler deployment to add verbose logging
    kubectl edit deploy kube-scheduler -n kube-system
    # In the container command args, add:
    # - --v=4
    # Then view the logs:
    kubectl logs -f deployment/kube-scheduler -n kube-system | grep -i <pod-name>
    6

    Step 6: Simulate Scheduling with `kubectl describe`

    Use the `kubectl describe` output to manually verify if any node meets the Pod's requirements.

    bash
    # From the 'FailedScheduling' event message, note the reasons (e.g., 'Insufficient memory', 'node(s) didn't match node selector').
    # Cross-reference:
    # 1. For 'Insufficient memory': Run `kubectl top nodes`.
    # 2. For selector/affinity: Run `kubectl get nodes --show-labels`.
    # 3. For taints: Run `kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints`.

    Architect's Pro Tip

    "Use `kubectl get pods --field-selector=status.phase=Pending -A` to quickly find all unscheduled Pods across namespaces before diving into individual descriptions."

    Frequently Asked Questions

    What's the difference between FailedScheduling and ImagePullBackOff?

    FailedScheduling happens BEFORE the Pod is assigned to a node (scheduling phase). ImagePullBackOff happens AFTER scheduling, when the node cannot pull the container image (runtime phase).

    Can a Pod be stuck in Pending for reasons other than FailedScheduling?

    Yes. A Pending Pod might be waiting for a PersistentVolumeClaim to be bound ('Waiting for volume to bind') or for a ClusterResource like a GPU device driver to be available, which are separate from scheduler failures.

    How do I fix '0/3 nodes are available: 3 node(s) had taint {node.kubernetes.io/not-ready}'?

    This taint is automatically added by K8s when a node is unhealthy. Fix the underlying node issue (kubelet, network). As a temporary workaround, you can add a toleration for this taint to your Pod spec, but this is not recommended for production.

    Related Kubernetes Guides