img resolving restart loops in proxmox environments proxmox crashes

Resolving restart loops in Proxmox environments

Troubleshooting Proxmox Crashes: A Practical Guide for Your Home Lab

I troubleshoot Proxmox crashes in my home lab a lot. Restart loops are the worst, because they mask the real error and waste time. This guide walks through what to look for, how to find the cause, and practical fixes you can apply straight away. No theory, just commands, logs and steps that change the result.

What you see

Symptoms of restart loops

The host boots, runs for a while, then reboots without a clean shutdown. The cycle repeats until you intervene. VMs may die mid-boot. Sometimes the system drops to a black screen for a few seconds before restarting. Other times it reboots immediately after a kernel message.

Common error messages

Watch for exact lines. Examples I see often:

  • “Kernel panic – not syncing: fatal exception”
  • “watchdog: BUG: soft lockup – CPU#X stuck for 22s”
  • “ACPI Error: [xxx] Namespace lookup failure”
  • “Rebooting in 1 seconds…” followed by no other useful trace

Copy the exact lines into your notes. Exact messages point to kernel, ACPI or watchdog problems.

Logs to check

Collect these immediately after a crash:

  • journalctl -b -1 -o short-iso | tac | sed -n ‘1,200p’ — last boot log, newest first
  • journalctl -k -b -1 | tail -n 200 — kernel messages from last boot
  • dmesg -T | tail -n 200 — kernel ring, timestamped
  • tail -n 200 /var/log/syslog — Debian/Proxmox syslog
  • pveversion -v — Proxmox and kernel versions

Expected vs actual: for a clean shutdown you expect systemd to log “Reached target Shutdown”. If you see nothing after a fatal kernel message, the crash is kernel or hardware triggered.

Where it happens

Hardware dependencies

Restart loops often tie to hardware faults. PSU voltage sag, failing caps on the motherboard, overheating CPU, or dodgy RAM can produce identical symptoms. I always test the PSU and RAM before chasing software.

Proxmox version specifics

Different Proxmox kernels interact with drivers differently. A kernel update can expose latent hardware bugs. Run pveversion -v and note the kernel version. If crashes started after an upgrade, boot an older kernel from GRUB to test if the kernel is the trigger.

Environmental factors

Check ambient temperature, power stability and added peripherals. Small changes matter. A USB device, PCIe NIC or failed HDD can throw interrupts that escalate to a panic. Note if the lab power strip shares a circuit with heavy appliances. Power noise shows as sudden restarts with no kernel panic output.

Find the cause

Diagnostic commands

Run these to collect evidence:

  1. pveversion -v — record Proxmox and kernel.
  2. dmidecode -t memory; lshw -class memory — inspect RAM details.
  3. sensors — check CPU and board temps.
  4. smartctl -a /dev/sda — drive health.
  5. memtester 1024 5 — test RAM from running system (not a substitute for memtest86).
  6. stress-ng –cpu 4 –timeout 300s — reproduce under load.

Expected: sensors shows sensible temps under idle. Actual: if temps spike or sensors return errors, suspect cooling or BIOS.

Analyzing logs

Look at timestamps. Does a spike in kernel messages precede the reboot by seconds? Kernel OOPS, watchdog or ACPI errors are key. Use journalctl –since “YYYY-MM-DD HH:MM:SS” to home in on the crash window. If logs end abruptly, that usually means an immediate power loss or CPU fault, not a graceful kernel panic.

Common hardware failures

I usually find one of these:

  • RAM errors: single-bit flips or ECC reports.
  • PSU instability: voltage drops under load.
  • Overheating: CPU or VRM temps hit limits.
  • Bad peripherals: faulty NICs, USB cards, or NVMe drives.
  • BIOS bugs: ACPI table issues, odd CPU microcode handling.

Memtest86 from USB and swapping RAM sticks into different slots is the fast way to confirm RAM. Swap the PSU if you can borrow one. Reproduce the crash with minimal hardware: one stick of RAM, no extra cards.

Fix

Recommended solutions

Fix order I use:

  1. Reproduce with minimal hardware.
  2. Run memtest86 for multiple passes.
  3. Boot an older kernel from GRUB.
  4. Update BIOS and firmware if a known issue exists.
  5. Replace suspect PSU or RAM.
  6. Try a live USB boot to rule out filesystem or install corruption.

Step-by-step fixes

  • Isolate RAM: remove all sticks except one, run memtest86. If errors appear, change the stick or replace.
  • Swap PSU: if a spare is available, substitute and run a stress test. If crashes stop, replace the original PSU.
  • Kernel rollback: at GRUB, choose an older kernel. If the host stabilises, hold the older kernel and report a regression to Proxmox with your logs.
  • BIOS update: download the exact BIOS for your motherboard, flash via USB following vendor steps. Record the previous BIOS version.
  • Disable non-essential devices: remove add-in cards and external USB devices, reboot, then reintroduce one by one.

Additional testing

After fixes, run stress tests for hours:

  • memtest86 for multiple passes.
  • stress-ng with CPU, memory and IO combined for several hours.
  • smartctl -t long /dev/sdX and check results.

If an NVMe or SATA drive is in doubt, try booting from another drive or live USB to isolate the storage.

Check it’s fixed

Monitoring stability

After repair, monitor for at least 24–72 hours under normal load. Set up a simple monitoring loop:

  • journalctl -f | tee /var/log/live-journal.log
  • watch -n 5 sensors
  • ping -i 0.2 8.8.8.8 > /dev/null & and check LaTeX? avoid — sorry, keep it simple: run a ping and record drops.

If a crash happens, the live-journal.log will hold the immediate pre-crash lines for analysis.

Follow-up checks

Re-run pveversion -v to confirm kernel unchanged. Re-run memtest86 monthly if suspect RAM, and run smartctl tests quarterly. If BIOS was updated, note any new options enabled by the vendor that could affect ACPI.

Documenting the process

Record:

  • exact error lines.
  • timestamps of crashes.
  • pveversion -v output.
  • hardware changes and firmware versions.
  • test results and pass/fail outcomes.

Keep this as a short incident note. It helps if crashes return and when reporting bugs to Proxmox or motherboard vendors.

Final takeaways: collect exact log lines and timestamps first. Isolate hardware before chasing configuration tweaks. Swap power and memory early. Boot an older kernel to check for regressions. These steps turn vague Proxmox crashes into a clear root cause and a tested fix.

Leave a Reply

Your email address will not be published. Required fields are marked *

Prev
Flux | v2.7.3
flux v2 7 3

Flux | v2.7.3

Explore the Flux v2

Next
Data protection strategies for modern IT infrastructure
img data protection strategies for modern it infrastructure backup patterns

Data protection strategies for modern IT infrastructure

Learn practical backup patterns for small teams and tight budgets

You May Also Like