CRITICAL

Kubernetes Pod: Fix EphemeralStorage Eviction during High Traffic Scaling

Quick Fix Summary

TL;DR

Delete evicted pods and scale down the deployment to reduce pressure.

The kubelet evicts pods when the node's ephemeral storage (emptyDir volumes, container logs, image layers) exceeds its limit, often triggered by rapid scaling under load.

Diagnosis & Causes

  • Unbounded log growth from application containers during high traffic.
  • Large emptyDir volumes or cache directories not being cleaned up.
  • Node disk space overallocation with insufficient buffer for scaling events.
  • Recovery Steps

    1

    Step 1: Verify and Diagnose the Eviction

    Identify the affected node, confirm ephemeral storage pressure, and list evicted pods.

    bash
    kubectl describe node <node-name> | grep -A 10 -B 5 EphemeralStorage
    kubectl get pods --all-namespaces --field-selector=status.phase=Failed -o wide
    kubectl describe pod <evicted-pod-name> -n <namespace> | grep -i message
    2

    Step 2: Immediate Cleanup of Evicted Pods

    Remove failed pods to free up their held ephemeral storage resources.

    bash
    kubectl get pods --all-namespaces --field-selector=status.phase=Failed -o jsonpath='{.items[*].metadata.name}' | xargs -n1 kubectl delete pod --now
    # Or for a specific namespace:
    kubectl delete pods --field-selector=status.phase=Failed -n <namespace>
    3

    Step 3: Free Disk Space on the Affected Node

    SSH into the node and clean up common ephemeral storage consumers: container logs, unused images, and kubelet cache.

    bash
    # Check disk usage
    df -h /var/lib/kubelet
    # Clean container logs (adjust path for your CRI)
    sudo find /var/log/pods -name "*.log" -type f -delete
    # Clean unused docker images
    sudo docker image prune -a -f
    # Clean kubelet cache (if using containerd)
    sudo crictl rmi --prune
    4

    Step 4: Scale Down the Offending Workload

    Reduce replica count to immediately lower pressure, allowing the node to recover.

    bash
    kubectl scale deployment <deployment-name> -n <namespace> --replicas=<reduced-number>
    5

    Step 5: Configure Pod Ephemeral Storage Limits and Requests

    Add ephemeral-storage requests and limits to pod specs to give the scheduler better visibility and enforce boundaries.

    yaml
    # Example container spec addition:
    resources:
      requests:
        ephemeral-storage: "1Gi"
      limits:
        ephemeral-storage: "2Gi"
    6

    Step 6: Implement Log Rotation and Size Limits

    Configure your container runtime (Docker/containerd) and application logging to prevent unbounded log growth.

    bash
    # For Docker (in daemon.json)
    {"log-driver": "json-file", "log-opts": {"max-size": "10m", "max-file": "3"}}
    # For a Pod using emptyDir with SizeLimit
    volumes:
    - name: log-volume
      emptyDir:
        sizeLimit: 500Mi
    7

    Step 7: Adjust Kubelet Eviction Thresholds

    Increase the node's ephemeral storage eviction threshold to provide a larger buffer, but ensure adequate monitoring.

    yaml
    # Add to kubelet configuration (e.g., /var/lib/kubelet/config.yaml)
    evictionHard:
      ephemeral-storage.available: "5%"
    # Then restart the kubelet
    sudo systemctl restart kubelet
    8

    Step 8: Monitor and Alert on Ephemeral Storage

    Set up Prometheus/Grafana alerts for node ephemeral storage usage to catch issues before evictions.

    bash
    # Example PromQL for alerting on high usage
    100 - (kubelet_volume_stats_available_bytes{persistentvolumeclaim=""} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=""} * 100) > 85

    Architect's Pro Tip

    "This often happens when applications write debug/trace logs to stdout/stderr without rotation during traffic spikes. The default container log driver stores these in /var/log/pods, consuming ephemeral storage. Implement application-level log throttling and use sidecar containers for log shipping instead of local storage."

    Frequently Asked Questions

    Will deleting evicted pods cause data loss?

    Ephemeral storage (emptyDir) data is lost when a pod is deleted. For persistent data, use PersistentVolumeClaims (PVCs). Evicted pods are already terminated, so deleting them only removes their metadata from the API server.

    How do I find which pod/container is using the most ephemeral storage?

    SSH into the node and run `sudo du -sh /var/lib/kubelet/pods/*` to see per-pod usage. Drill down into `volumes/` and `containers/` subdirectories to identify the culprit.

    Related Kubernetes Guides