Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers
Quick Fix Summary
TL;DRBoot from a known-good kernel, check system logs, and analyze the panic message for hardware or driver failures.
A kernel panic is an unrecoverable system-level error where the Linux kernel halts to prevent data corruption. It's triggered by critical failures in core kernel code, hardware, or drivers.
Diagnosis & Causes
Recovery Steps
Step 1: Secure Immediate Logs and System State
If the system is partially responsive or you can access it via a serial console/KVM, capture the panic screen and any preceding messages. This is your primary evidence.
# If in a text console, scroll back to capture the full panic output.
# Use 'dmesg' to see recent kernel messages if the system is still up.
dmesg -T | tail -100 Step 2: Analyze Post-Crash Logs (Journalctl & /var/log)
After a reboot, the system logs contain the most critical data. The kernel logs its final moments to the ring buffer, which journald or syslog may have captured.
# Check the systemd journal for kernel messages around the crash time.
sudo journalctl -k --since "2 hours ago" --until "now"
# Also check traditional log files.
sudo grep -i "panic\|oops\|BUG" /var/log/kern.log /var/log/syslog Step 3: Inspect the Kernel Core Dump (if configured)
If kdump is enabled, a vmcore file is created. Analyze it with crash utility to get a stack trace and examine kernel state at panic.
# Install the crash utility and kernel debug symbols.
sudo apt install crash linux-image-$(uname -r)-dbgsym # Debian/Ubuntu
sudo yum install crash kernel-debuginfo # RHEL/CentOS
# Analyze the vmcore.
sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/<timestamp>/vmcore
# Inside crash, run 'bt' (backtrace) and 'log'. Step 4: Isolate Hardware vs. Software Cause
Run hardware diagnostics and review recent software changes to pinpoint the fault domain.
# Run a memory test (requires bootable media like Memtest86+).
# Check disk health.
sudo smartctl -a /dev/sdX
# List recently updated/installed kernel modules and packages.
rpm -qa --last | head -20 # RHEL
grep "install " /var/log/dpkg.log | tail -20 # Debian/Ubuntu Step 5: Boot with a Stable Kernel and Minimal Configuration
Boot from a previous, known-stable kernel in GRUB. If it boots, the issue is with the new kernel or its modules. Use kernel boot parameters to further isolate.
# In GRUB, select an older kernel version.
# To troubleshoot, add boot parameters in GRUB ('e' to edit).
# Disable problematic hardware/drivers:
linux /vmlinuz ... modprobe.blacklist=nouveau,intel_lpss_pci irqpoll
# Use a minimal runlevel:
linux /vmlinuz ... systemd.unit=rescue.target Step 6: Reproduce in a Test Environment (If Possible)
If the cause is suspected to be a kernel bug or specific driver, attempt to replicate the conditions (same kernel version, workload, hardware) on a non-production system.
# Clone the kernel config from the production server.
zcat /proc/config.gz > .config
# Build and test the same kernel version in a VM or spare hardware.
# Stress-test specific components (e.g., memory, disk I/O).
stress-ng --vm 2 --vm-bytes 2G --timeout 60s Architect's Pro Tip
"Configure kdump *before* a panic occurs. Test it works by triggering a panic with 'echo c > /proc/sysrq-trigger' on a test system. A valid vmcore is worth 1000 log entries."
Frequently Asked Questions
What's the difference between a Kernel Panic and an Oops?
An Oops is a non-fatal kernel error where the kernel can often continue running (though possibly unstable). A Panic is a deliberate, unrecoverable halt triggered by a critical Oops or other condition to prevent data corruption.
My server panics randomly once a month. How do I find the cause?
Intermittent panics are often hardware-related (failing RAM, overheating, PSU). Enable persistent, detailed logging (syslog to remote server), configure kdump, and run extended hardware diagnostics (Memtest86+ for 24+ hours, CPU stress tests).
The panic message mentions 'Not syncing' or 'Kernel panic - not syncing'. What does this mean?
This is the standard panic preamble. The key diagnostic information follows this line: the specific error type (e.g., 'Unable to handle kernel NULL pointer dereference'), the EIP/instruction pointer, and the call trace (stack backtrace). Focus your analysis there.