CRITICAL

Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers

Quick Fix Summary

TL;DR

Boot from a known-good kernel, check system logs, and analyze the panic message for hardware or driver failures.

A kernel panic is an unrecoverable system-level error where the Linux kernel halts to prevent data corruption. It's triggered by critical failures in core kernel code, hardware, or drivers.

Diagnosis & Causes

  • Faulty or incompatible hardware (RAM, CPU, storage).
  • Buggy or misconfigured kernel modules or drivers.
  • Kernel memory corruption or NULL pointer dereference.
  • File system corruption on critical partitions.
  • Overheating or severe power supply issues.
  • Recovery Steps

    1

    Step 1: Secure Immediate Logs and System State

    If the system is partially responsive or you can access it via a serial console/KVM, capture the panic screen and any preceding messages. This is your primary evidence.

    bash
    # If in a text console, scroll back to capture the full panic output.
    # Use 'dmesg' to see recent kernel messages if the system is still up.
    dmesg -T | tail -100
    2

    Step 2: Analyze Post-Crash Logs (Journalctl & /var/log)

    After a reboot, the system logs contain the most critical data. The kernel logs its final moments to the ring buffer, which journald or syslog may have captured.

    bash
    # Check the systemd journal for kernel messages around the crash time.
    sudo journalctl -k --since "2 hours ago" --until "now"
    # Also check traditional log files.
    sudo grep -i "panic\|oops\|BUG" /var/log/kern.log /var/log/syslog
    3

    Step 3: Inspect the Kernel Core Dump (if configured)

    If kdump is enabled, a vmcore file is created. Analyze it with crash utility to get a stack trace and examine kernel state at panic.

    bash
    # Install the crash utility and kernel debug symbols.
    sudo apt install crash linux-image-$(uname -r)-dbgsym # Debian/Ubuntu
    sudo yum install crash kernel-debuginfo # RHEL/CentOS
    # Analyze the vmcore.
    sudo crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /var/crash/<timestamp>/vmcore
    # Inside crash, run 'bt' (backtrace) and 'log'.
    4

    Step 4: Isolate Hardware vs. Software Cause

    Run hardware diagnostics and review recent software changes to pinpoint the fault domain.

    bash
    # Run a memory test (requires bootable media like Memtest86+).
    # Check disk health.
    sudo smartctl -a /dev/sdX
    # List recently updated/installed kernel modules and packages.
    rpm -qa --last | head -20 # RHEL
    grep "install " /var/log/dpkg.log | tail -20 # Debian/Ubuntu
    5

    Step 5: Boot with a Stable Kernel and Minimal Configuration

    Boot from a previous, known-stable kernel in GRUB. If it boots, the issue is with the new kernel or its modules. Use kernel boot parameters to further isolate.

    bash
    # In GRUB, select an older kernel version.
    # To troubleshoot, add boot parameters in GRUB ('e' to edit).
    # Disable problematic hardware/drivers:
    linux /vmlinuz ... modprobe.blacklist=nouveau,intel_lpss_pci irqpoll
    # Use a minimal runlevel:
    linux /vmlinuz ... systemd.unit=rescue.target
    6

    Step 6: Reproduce in a Test Environment (If Possible)

    If the cause is suspected to be a kernel bug or specific driver, attempt to replicate the conditions (same kernel version, workload, hardware) on a non-production system.

    bash
    # Clone the kernel config from the production server.
    zcat /proc/config.gz > .config
    # Build and test the same kernel version in a VM or spare hardware.
    # Stress-test specific components (e.g., memory, disk I/O).
    stress-ng --vm 2 --vm-bytes 2G --timeout 60s

    Architect's Pro Tip

    "Configure kdump *before* a panic occurs. Test it works by triggering a panic with 'echo c > /proc/sysrq-trigger' on a test system. A valid vmcore is worth 1000 log entries."

    Frequently Asked Questions

    What's the difference between a Kernel Panic and an Oops?

    An Oops is a non-fatal kernel error where the kernel can often continue running (though possibly unstable). A Panic is a deliberate, unrecoverable halt triggered by a critical Oops or other condition to prevent data corruption.

    My server panics randomly once a month. How do I find the cause?

    Intermittent panics are often hardware-related (failing RAM, overheating, PSU). Enable persistent, detailed logging (syslog to remote server), configure kdump, and run extended hardware diagnostics (Memtest86+ for 24+ hours, CPU stress tests).

    The panic message mentions 'Not syncing' or 'Kernel panic - not syncing'. What does this mean?

    This is the standard panic preamble. The key diagnostic information follows this line: the specific error type (e.g., 'Unable to handle kernel NULL pointer dereference'), the EIP/instruction pointer, and the call trace (stack backtrace). Focus your analysis there.

    Related Linux Guides