CRITICAL

Solved: Azure Kubernetes Service (AKS) Pod CrashLoopBackOff in Kubernetes 1.31

Quick Fix Summary

TL;DR

Check pod logs, verify resource limits, and ensure application health checks are correctly configured.

CrashLoopBackOff indicates a pod is repeatedly crashing and restarting, failing to reach a stable running state. In Kubernetes 1.31, this often stems from application errors, misconfigured probes, or insufficient resources.

Diagnosis & Causes

  • Application startup error or crash.
  • Misconfigured liveness or readiness probe.
  • Insufficient memory (OOMKilled) or CPU.
  • Missing dependencies or configuration.
  • Incorrect image tag or permissions.
  • Recovery Steps

    1

    Step 1: Immediate Diagnostics - Inspect Pod State & Logs

    First, gather the pod's status and the logs from its most recent crash to identify the immediate failure.

    bash
    kubectl get pods -n <namespace>
    kubectl describe pod <pod-name> -n <namespace>
    kubectl logs <pod-name> -n <namespace> --previous
    2

    Step 2: Analyze Container Exit Codes & Resource Limits

    Check if the container was terminated by the system (OOM) and review its requested/allocated resources.

    bash
    kubectl describe pod <pod-name> -n <namespace> | grep -A 10 -B 5 "Last State\|Terminated\|OOM\|Limits"
    kubectl top pod <pod-name> -n <namespace> --containers
    3

    Step 3: Validate & Debug Liveness/Readiness Probes

    Incorrect probes are a common cause. Test the probe endpoints manually and adjust timeouts/periods.

    yaml
    # Test the liveness probe endpoint from within the cluster
    kubectl exec -it <a-healthy-pod> -n <namespace> -- curl -v http://<pod-ip>:<port>/healthz
    # Example probe adjustment in deployment YAML
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3
    4

    Step 4: Debug Application Startup Locally & Check Dependencies

    Simulate the container's entry point locally to catch early runtime errors like missing files or env vars.

    bash
    # Get the pod's image and command
    kubectl describe pod <pod-name> -n <namespace> | grep -i "image\|command\|args"
    # Run a debug container with a shell to inspect the filesystem
    kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container-name> -- sh
    5

    Step 5: Apply Fix & Perform a Controlled Rollout

    After identifying the root cause, update your deployment. Use a strategic rollout to prevent widespread impact.

    bash
    # Apply the fix to your deployment manifest
    kubectl apply -f deployment-fix.yaml
    # Monitor the new rollout closely
    kubectl rollout status deployment/<deployment-name> -n <namespace>
    # Rollback immediately if the new pods crash
    kubectl rollout undo deployment/<deployment-name> -n <namespace>

    Architect's Pro Tip

    "For .NET or Java apps on AKS, a CrashLoopBackOff with exit code 139 (SIGSEGV) often indicates a memory issue. Enable Docker's `--oom-kill-disable` for debugging, but always set correct memory limits."

    Frequently Asked Questions

    What's the difference between CrashLoopBackOff and ImagePullBackOff?

    CrashLoopBackOff means the container started but then crashed. ImagePullBackOff means Kubernetes cannot retrieve the container image from the registry (e.g., auth error, wrong tag).

    How do I prevent CrashLoopBackOff during AKS upgrades to 1.31?

    Test your application's compatibility in a non-production cluster first. Pay special attention to any deprecated APIs or changes in security context that may affect your pods.

    My pod logs show nothing before the crash. What next?

    The application may be crashing before it can write to stdout. Use `kubectl debug` to run a sidecar container or increase the `terminationMessagePolicy` to `FallbackToLogsOnError` and check the pod's `terminationMessage` field.

    Related Azure Guides