CRITICAL

Troubleshooting Guide: Diagnosing Linux Kernel Panic on Production Servers

Quick Fix Summary

TL;DR

Boot from a known-good kernel, check system logs, and analyze the panic message for hardware or driver failures.

A Kernel Panic is an unrecoverable system-level error where the Linux kernel halts to prevent data corruption. It's triggered by critical failures in kernel code, hardware, or drivers that compromise system integrity.

Diagnosis & Causes

  • Faulty or incompatible hardware (RAM, CPU, storage).
  • Buggy or misconfigured kernel modules or drivers.
  • Corrupted filesystem or disk errors.
  • Kernel bugs or incompatible kernel updates.
  • Overheating or insufficient power supply.
  • Recovery Steps

    1

    Step 1: Secure Immediate Evidence from the Console

    If the server is accessible, photograph or transcribe the entire panic screen. The call trace and register dump are critical for diagnosis.

    bash
    # If the system is still running but unstable, force a panic to get a log (USE WITH EXTREME CAUTION)
    echo c > /proc/sysrq-trigger
    2

    Step 2: Boot into a Rescue Environment & Collect Logs

    Boot from a live USB/DVD or a known-good kernel. Mount the root filesystem and extract all relevant logs from the failed boot.

    bash
    # Mount the root partition from the rescue environment
    mount /dev/sdX1 /mnt
    # Copy critical logs for analysis
    cp /mnt/var/log/kern.log* /mnt/var/log/dmesg* /mnt/var/log/syslog* /root/panic_analysis/
    3

    Step 3: Analyze Kernel Logs for Oops and Panic Context

    Search logs for 'Oops', 'panic', 'BUG', and the call trace. The line BEFORE the panic often indicates the culprit.

    bash
    grep -B 20 -A 5 "Kernel panic" /var/log/kern.log
    grep -B 10 "Oops" /var/log/kern.log
    dmesg | tail -100
    4

    Step 4: Isolate the Faulty Component via Call Trace

    Decode the call trace (EIP/RIP register and function names) to identify if the failure is in a specific driver (e.g., nvidia, e1000) or core kernel.

    bash
    # Example: Look for module names in the trace. This points to the 'nv' driver.
    # Call Trace:||#[ 123.456]  [<ffffffffa0123456>] ? nv_ioctl+0x123/0x456 [nv]
    5

    Step 5: Perform Hardware Diagnostics

    Rule out hardware failure, which is a common root cause. Test memory and CPU thoroughly.

    bash
    # Test system memory (run for several passes)
    memtester 2G 2
    # Check CPU for errors via mcelog
    mcelog --client
    # Check disk health
    smartctl -a /dev/sda
    6

    Step 6: Implement a Mitigation and Restore Service

    Based on analysis, blacklist a faulty module, revert a kernel update, or schedule hardware replacement. Boot with minimal modules.

    bash
    # Blacklist a driver module causing panic
    echo "blacklist faulty_module" >> /etc/modprobe.d/blacklist.conf
    # Update initramfs and reboot
    update-initramfs -u -k all
    reboot
    7

    Step 7: Configure Persistent Crash Dumping (Kdump)

    For future panics, configure Kdump to capture a full kernel memory dump (vmcore) to disk for offline analysis with 'crash' utility.

    bash
    # Install kdump tools
    apt install kdump-tools || yum install kexec-tools
    # Configure crash kernel memory in /etc/default/grub
    GRUB_CMDLINE_LINUX="crashkernel=256M"
    # Enable and start the service
    systemctl enable kdump.service
    systemctl start kdump.service

    Architect's Pro Tip

    "Panics often occur minutes after the real fault. Correlate timestamps with systemd journal logs (`journalctl -S -1hour`) to find the triggering service or hardware event."

    Frequently Asked Questions

    What's the difference between an 'Oops' and a 'Kernel Panic'?

    An 'Oops' is a non-fatal kernel error where the kernel can often continue running (though possibly corrupted). A 'Panic' is a deliberate, unrecoverable halt to prevent filesystem/data corruption from an irreparable error.

    The server is completely unresponsive after a panic. How do I get the logs?

    You have three options: 1) Physical/IPMI console screenshot, 2) Serial console output if configured, 3) A kdump vmcore if it was set up prior to the crash. Otherwise, you must rely on the console message.

    Should I always update the kernel after a panic?

    Not immediately. First, diagnose. If the trace points to a known bug fixed in a later stable kernel, then update. Blindly updating can introduce new incompatibilities. Reverting to the last-known-good kernel is a safer first step.

    Related Linux Guides