CRITICAL

Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers

Quick Fix Summary

TL;DR

Boot from a known-good kernel, check system logs, and analyze the panic message for hardware or driver failures.

A kernel panic is an unrecoverable system-level error where the Linux kernel halts to prevent data corruption. It's triggered by critical failures in core kernel code, hardware, or drivers.

Diagnosis & Causes

Faulty or incompatible hardware (RAM, CPU, storage).

Buggy or misconfigured kernel modules or drivers.

Kernel memory corruption or NULL pointer dereference.

File system corruption on critical partitions.

Overheating or severe power supply issues.

Recovery Steps

Step 1: Secure Immediate Logs and System State

If the system is partially responsive or you can access it via a serial console/KVM, capture the panic screen and any preceding messages. This is your primary evidence.

bash

# If in a text console, scroll back to capture the full panic output.
# Use 'dmesg' to see recent kernel messages if the system is still up.
dmesg -T | tail -100

Step 2: Analyze Post-Crash Logs (Journalctl & /var/log)

After a reboot, the system logs contain the most critical data. The kernel logs its final moments to the ring buffer, which journald or syslog may have captured.

bash

# Check the systemd journal for kernel messages around the crash time.
sudo journalctl -k --since "2 hours ago" --until "now"
# Also check traditional log files.
sudo grep -i "panic\|oops\|BUG" /var/log/kern.log /var/log/syslog

Step 3: Inspect the Kernel Core Dump (if configured)

If kdump is enabled, a vmcore file is created. Analyze it with crash utility to get a stack trace and examine kernel state at panic.

bash

# Install the crash utility and kernel debug symbols.
sudo apt install crash linux-image-$(uname -r)-dbgsym # Debian/Ubuntu
sudo yum install crash kernel-debuginfo # RHEL/CentOS
# Analyze the vmcore.
sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/<timestamp>/vmcore
# Inside crash, run 'bt' (backtrace) and 'log'.

Step 4: Isolate Hardware vs. Software Cause

Run hardware diagnostics and review recent software changes to pinpoint the fault domain.

bash

# Run a memory test (requires bootable media like Memtest86+).
# Check disk health.
sudo smartctl -a /dev/sdX
# List recently updated/installed kernel modules and packages.
rpm -qa --last | head -20 # RHEL
grep "install " /var/log/dpkg.log | tail -20 # Debian/Ubuntu

Step 5: Boot with a Stable Kernel and Minimal Configuration

Boot from a previous, known-stable kernel in GRUB. If it boots, the issue is with the new kernel or its modules. Use kernel boot parameters to further isolate.

bash

# In GRUB, select an older kernel version.
# To troubleshoot, add boot parameters in GRUB ('e' to edit).
# Disable problematic hardware/drivers:
linux /vmlinuz ... modprobe.blacklist=nouveau,intel_lpss_pci irqpoll
# Use a minimal runlevel:
linux /vmlinuz ... systemd.unit=rescue.target

Step 6: Reproduce in a Test Environment (If Possible)

If the cause is suspected to be a kernel bug or specific driver, attempt to replicate the conditions (same kernel version, workload, hardware) on a non-production system.

bash

# Clone the kernel config from the production server.
zcat /proc/config.gz > .config
# Build and test the same kernel version in a VM or spare hardware.
# Stress-test specific components (e.g., memory, disk I/O).
stress-ng --vm 2 --vm-bytes 2G --timeout 60s

Architect's Pro Tip

"Configure kdump *before* a panic occurs. Test it works by triggering a panic with 'echo c > /proc/sysrq-trigger' on a test system. A valid vmcore is worth 1000 log entries."

Frequently Asked Questions

What's the difference between a Kernel Panic and an Oops?

An Oops is a non-fatal kernel error where the kernel can often continue running (though possibly unstable). A Panic is a deliberate, unrecoverable halt triggered by a critical Oops or other condition to prevent data corruption.

My server panics randomly once a month. How do I find the cause?

Intermittent panics are often hardware-related (failing RAM, overheating, PSU). Enable persistent, detailed logging (syslog to remote server), configure kdump, and run extended hardware diagnostics (Memtest86+ for 24+ hours, CPU stress tests).

The panic message mentions 'Not syncing' or 'Kernel panic - not syncing'. What does this mean?

This is the standard panic preamble. The key diagnostic information follows this line: the specific error type (e.g., 'Unable to handle kernel NULL pointer dereference'), the EIP/instruction pointer, and the call trace (stack backtrace). Focus your analysis there.

Related Linux Guides

502 Bad Gateway

Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers

Quick Fix Summary

Diagnosis & Causes

Recovery Steps

Step 1: Secure Immediate Logs and System State

Step 2: Analyze Post-Crash Logs (Journalctl & /var/log)

Step 3: Inspect the Kernel Core Dump (if configured)

Step 4: Isolate Hardware vs. Software Cause

Step 5: Boot with a Stable Kernel and Minimal Configuration

Step 6: Reproduce in a Test Environment (If Possible)

Architect's Pro Tip

Frequently Asked Questions

What's the difference between a Kernel Panic and an Oops?

My server panics randomly once a month. How do I find the cause?

The panic message mentions 'Not syncing' or 'Kernel panic - not syncing'. What does this mean?

Related Linux Guides

How to Fix 502 Bad Gateway in Nginx/Apache on Ubuntu 24.04

How to Fix 502 Bad Gateway in NGINX on Ubuntu 24.04 LTS

Kubernetes Troubleshooting Guide: Diagnosing ImagePullBackOff on Linux Nodes