CRITICAL

Solved: Azure Kubernetes Service (AKS) Pod CrashLoopBackOff in Kubernetes 1.31

Quick Fix Summary

TL;DR

Check pod logs, verify resource limits, and ensure application health checks are correctly configured.

CrashLoopBackOff indicates a pod is repeatedly crashing and restarting, failing to reach a stable running state. In Kubernetes 1.31, this often stems from application errors, misconfigured probes, or insufficient resources.

Diagnosis & Causes

Application startup error or crash.

Misconfigured liveness or readiness probe.

Insufficient memory (OOMKilled) or CPU.

Missing dependencies or configuration.

Incorrect image tag or permissions.

Recovery Steps

Step 1: Immediate Diagnostics - Inspect Pod State & Logs

First, gather the pod's status and the logs from its most recent crash to identify the immediate failure.

bash

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Step 2: Analyze Container Exit Codes & Resource Limits

Check if the container was terminated by the system (OOM) and review its requested/allocated resources.

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 -B 5 "Last State\|Terminated\|OOM\|Limits"
kubectl top pod <pod-name> -n <namespace> --containers

Step 3: Validate & Debug Liveness/Readiness Probes

Incorrect probes are a common cause. Test the probe endpoints manually and adjust timeouts/periods.

yaml

# Test the liveness probe endpoint from within the cluster
kubectl exec -it <a-healthy-pod> -n <namespace> -- curl -v http://<pod-ip>:<port>/healthz
# Example probe adjustment in deployment YAML
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Step 4: Debug Application Startup Locally & Check Dependencies

Simulate the container's entry point locally to catch early runtime errors like missing files or env vars.

bash

# Get the pod's image and command
kubectl describe pod <pod-name> -n <namespace> | grep -i "image\|command\|args"
# Run a debug container with a shell to inspect the filesystem
kubectl debug -it <pod-name> -n <namespace> --image=busybox --target=<container-name> -- sh

Step 5: Apply Fix & Perform a Controlled Rollout

After identifying the root cause, update your deployment. Use a strategic rollout to prevent widespread impact.

bash

# Apply the fix to your deployment manifest
kubectl apply -f deployment-fix.yaml
# Monitor the new rollout closely
kubectl rollout status deployment/<deployment-name> -n <namespace>
# Rollback immediately if the new pods crash
kubectl rollout undo deployment/<deployment-name> -n <namespace>

Architect's Pro Tip

"For .NET or Java apps on AKS, a CrashLoopBackOff with exit code 139 (SIGSEGV) often indicates a memory issue. Enable Docker's `--oom-kill-disable` for debugging, but always set correct memory limits."

Frequently Asked Questions

What's the difference between CrashLoopBackOff and ImagePullBackOff?

CrashLoopBackOff means the container started but then crashed. ImagePullBackOff means Kubernetes cannot retrieve the container image from the registry (e.g., auth error, wrong tag).

How do I prevent CrashLoopBackOff during AKS upgrades to 1.31?

Test your application's compatibility in a non-production cluster first. Pay special attention to any deprecated APIs or changes in security context that may affect your pods.

My pod logs show nothing before the crash. What next?

The application may be crashing before it can write to stdout. Use `kubectl debug` to run a sidecar container or increase the `terminationMessagePolicy` to `FallbackToLogsOnError` and check the pod's `terminationMessage` field.

Related Azure Guides

SqlTimeoutException

Solved: Azure Kubernetes Service (AKS) Pod CrashLoopBackOff in Kubernetes 1.31

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Immediate Diagnostics - Inspect Pod State & Logs

Step 2: Analyze Container Exit Codes & Resource Limits

Step 3: Validate & Debug Liveness/Readiness Probes

Step 4: Debug Application Startup Locally & Check Dependencies

Step 5: Apply Fix & Perform a Controlled Rollout

Architect's Pro Tip

Frequently Asked Questions

What's the difference between CrashLoopBackOff and ImagePullBackOff?

How do I prevent CrashLoopBackOff during AKS upgrades to 1.31?

My pod logs show nothing before the crash. What next?

Related Azure Guides

How to Fix Azure SQL Database Connection Timeout

How to Fix Azure SQL ServerBusy (DTU Limit Reached)

How to Fix Azure 502 Bad Gateway