Troubleshooting Proxmox crash loops

Troubleshooting Proxmox crashes

I end up chasing Proxmox crash loops more often than I’d like. They hide the real fault and waste time. The useful part is that the logs usually tell you where to start if you catch them before the box comes back up.

What it looks like

The host boots, runs for a while, then reboots without a clean shutdown. The cycle repeats until you stop it. VMs can die mid-boot. Sometimes the screen goes black for a few seconds before the restart. Sometimes it reboots straight after a kernel message.

Common error messages

Watch the exact lines. These are the ones I see most often:

“Kernel panic – not syncing: fatal exception”
“watchdog: BUG: soft lockup – CPU#X stuck for 22s”
“ACPI Error: [xxx] Namespace lookup failure”
“Rebooting in 1 seconds…” followed by nothing useful

Copy the exact text into your notes. Those lines usually point to kernel, ACPI, watchdog, or hardware trouble.

Logs to pull first

Grab these straight after a crash:

journalctl -b -1 -o short-iso | tac | sed -n '1,200p'

journalctl -k -b -1 | tail -n 200

dmesg -T | tail -n 200

tail -n 200 /var/log/syslog

pveversion -v

For a clean shutdown, I expect systemd to log “Reached target Shutdown”. If the log stops after a fatal kernel message, the crash is usually kernel- or hardware-related.

Where to look next

Hardware faults show up as restart loops far too often. PSU voltage sag, failing motherboard capacitors, overheating CPU, and bad RAM can all look the same from the front. I test the PSU and RAM before I start blaming Proxmox.

Kernel changes matter too. Different Proxmox kernels can behave differently with the same hardware. A kernel update can expose a problem that was already there. Check pveversion -v and note the kernel version. If the crashes started after an upgrade, boot an older kernel from GRUB and see whether the host settles down.

Ambient temperature, power stability, and extra peripherals can all push a marginal system over the edge. A USB device, PCIe NIC, or failing HDD can trigger interrupts and lead to a panic. I also note whether the lab power strip shares a circuit with heavier appliances. Power noise tends to show up as sudden restarts with no useful panic output.

Commands that help

These are the checks I run while collecting evidence:

pveversion -v

dmidecode -t memory; lshw -class memory

sensors

smartctl -a /dev/sda

memtester 1024 5

stress-ng --cpu 4 --timeout 300s

sensors should show sensible idle temperatures. If temperatures spike or the tool errors out, I start looking at cooling or BIOS settings.

Reading the logs

Time matters. If kernel messages spike a few seconds before the reboot, that is a useful clue. Kernel OOPS, watchdog, and ACPI errors are the lines I care about. journalctl --since "YYYY-MM-DD HH:MM:SS" helps narrow the crash window. If the logs end abruptly, that usually looks more like power loss or a CPU fault than a clean kernel panic.

Common hardware faults

RAM errors: single-bit flips or ECC reports.
PSU instability: voltage drops under load.
Overheating: CPU or VRM temps hit limits.
Bad peripherals: faulty NICs, USB cards, or NVMe drives.
BIOS bugs: ACPI table issues, odd CPU microcode handling.

Memtest86 from USB and moving RAM sticks between slots is the quickest way to confirm memory trouble. If you can, borrow a PSU and swap it in. I also try to reproduce the crash with the minimum hardware: one stick of RAM and no extra cards.

Fix order

Reproduce with minimal hardware.
Run Memtest86 for multiple passes.
Boot an older kernel from GRUB.
Update BIOS and firmware if there is a known issue.
Replace the suspect PSU or RAM.
Try a live USB boot to rule out filesystem or install corruption.

Step-by-step fixes

Isolate RAM: remove all sticks except one, then run Memtest86. If errors appear, swap the stick or replace it.
Swap the PSU: if you have a spare, fit it and run a stress test. If the crashes stop, replace the original PSU.
Roll back the kernel: at GRUB, choose an older kernel. If the host stabilises, keep the older kernel and report the regression to Proxmox with your logs.
Update the BIOS: download the exact BIOS for the motherboard, flash it from USB using the vendor steps, and record the previous version.
Remove non-essential devices: take out add-in cards and external USB devices, reboot, then add them back one at a time.

After the fix

Once the host is stable, I leave it under normal load for at least 24 to 72 hours. A simple watch is enough:

journalctl -f | tee /var/log/live-journal.log

watch -n 5 sensors

ping -i 0.2 8.8.8.8

If it crashes again, live-journal.log should hold the last useful lines before the reboot.

Follow-up checks

Run pveversion -v again to confirm the kernel has not changed. Rerun Memtest86 monthly if RAM still looks suspect, and run SMART tests every few months. If you updated the BIOS, note any new vendor settings that might affect ACPI.

Keep a short record

Exact error lines.
Timestamps of crashes.
pveversion -v output.
Hardware changes and firmware versions.
Test results and pass or fail outcomes.

A short incident note is enough. It helps when the crash comes back and when you need to report it to Proxmox or the motherboard vendor.

Start with the exact log lines and timestamps, then strip the system back to the minimum hardware. RAM and power are the usual culprits. An older kernel is worth checking if the failures started after an update.