img mitigating vm downtime post proxmox host migration proxmox vm migration

Mitigating VM downtime post Proxmox host migration

I had a Proxmox VM migration leave services unreachable for a few minutes. The VMs had moved hosts cleanly, but traffic still went to the old switch port. That delay usually points at the physical switch holding old MAC or ARP entries. The steps below are practical checks and fixes I use when a Proxmox VM migration breaks VM accessibility.

What you see

Symptoms of VM inaccessibility

  • VM pings fail immediately after migration.
  • SSH or application ports do not respond for a short while.
  • ARP shows the wrong MAC for the VM IP on the gateway or host.

Common error messages

  • ping: sendmsg: Operation not permitted
  • ping: transmit failed. General failure
  • ping: sendto: Host is unreachable
  • bridge or kernel logs showing unknown neighbour or stale entries.

Timeframes for downtime

  • The gap is usually seconds to several minutes.
  • If the switch MAC/ARP aging is long, downtime matches that timer.
  • Cheap or unmanaged switches can take longer to relearn MAC addresses.

Where it happens

Network switch involvement

  • Physical switches learn MACs per port. After a Proxmox VM migration, the VM’s MAC moves to a different host and port. If the switch still maps that MAC to the old port, traffic goes the wrong way.
  • Managed switches let you view and clear MAC and ARP tables. Unmanaged switches do not, and they can be slow to adapt.

Impact of ARP table updates

  • Hosts and routers cache IP-to-MAC mappings in ARP. If the ARP entry points at the old MAC or is missing, traffic fails until ARP is refreshed.
  • Linux hosts also cache neighbours. The command ip neigh shows the kernel ARP/NDP cache.

VM accessibility issues

  • On the destination host the VM is running fine locally. Networking fails because upstream devices still send frames to the previous port.
  • That makes it look like migration broke the VM. The VM is running, the network path is stale.

Find the cause

Diagnosing switch configurations

  • On a managed switch, list MAC table entries. Cisco example:

    show mac address-table | include aa:bb:cc:dd:ee:ff

    Expected: MAC mapped to the new host port. Actual problem: MAC points to old port.

  • For ARP on an IP router:

    show ip arp | include 192.0.2.10

    Expected: IP mapped to MAC of destination host. Actual: old MAC or no entry.

Checking ARP and MAC table settings

  • On Linux, check ARP/neighbor entries:

    ip neigh show | grep 192.0.2.10

    Example outputs:

    • Good: 192.0.2.10 dev vmbr0 lladdr aa:bb:cc:dd:ee:ff REACHABLE
    • Bad: 192.0.2.10 dev vmbr0 INCOMPLETE
    • Stale: 192.0.2.10 dev vmbr0 lladdr aa:bb:cc:dd:ee:ff STALE
  • Check bridge FDB on the host:

    bridge fdb show

    Expected: VM MAC present on the host where the VM runs. Actual: MAC on other host or missing.

Identifying host migration problems

  • Confirm the VM’s MAC after migration:

    ip link show dev vmbr0
    virsh domiflist # or qm config to list MACs

  • Confirm the VM process is listening. From the destination host:

    ss -tlnp | grep :22

  • If VM listens locally but network path fails, the problem is external to Proxmox. That points at switch/MAC or ARP caching.

Fix

Steps to clear ARP tables

  • On the Linux hosts:

    ip neigh flush all
    bridge fdb flush

    Or target a single IP or MAC:

    ip neigh del 192.0.2.10 dev vmbr0
    bridge fdb del aa:bb:cc:dd:ee:ff dev vmbr0

    Expected result: ip neigh shows a new entry after traffic; bridge fdb shows MAC on the correct host.

  • On common switches:

    • Cisco:

    clear mac address-table dynamic address aa:bb:cc:dd:ee:ff
    clear arp-cache

    • Other vendors use similar commands. If you cannot clear per-MAC, clear the dynamic table or reboot the switch port.

Forcing ARP updates from the VM or host

  • From the VM or its host send gratuitous ARP:

    arping -c 3 -A -I eth0 192.0.2.10

    Or use:

    ip neigh replace 192.0.2.10 lladdr aa:bb:cc:dd:ee:ff dev eth0 nud_reachable

    Expected: gateway and switch learn the new MAC quickly. Actual: immediate restoration of traffic if the switch accepts the gratuitous ARP.

Adjusting switch settings

  • Reduce MAC aging or ARP cache timers on the switch to speed relearning. Typical commands vary by vendor. Example:

    • On Cisco switches, set mac-address-table aging to a lower value:

    mac-address-table aging-time 120

    • Match timers to how often you migrate VMs. If live migration is frequent, use a lower timer.
  • On unmanaged or cheap switches, replace with a managed unit if the problem recurs. Cheap switches can have buggy MAC learning.

Testing VM connectivity

  • After clearing entries and sending gratuitous ARP, test:

    ping -c 3 192.0.2.10
    arp -n | grep 192.0.2.10
    ip neigh show 192.0.2.10
    bridge fdb show | grep aa:bb:cc:dd:ee:ff

    Expected: ping success, ARP maps to correct MAC, bridge fdb lists MAC on destination host.

Check it’s fixed

Confirming VM accessibility post-fix

  • Run a sequence of checks after a migration:
    1. Confirm the VM is running on the destination host: qm status <vmid>
    2. Confirm VM network interface MAC on that host: bridge fdb show | grep aa:bb:cc:dd:ee:ff
    3. From a remote node, ping and perform a TCP connect: nc -vz 192.0.2.10 22

Monitoring ongoing performance

  • Watch for repeat occurrences. If downtime repeats on every Proxmox VM migration, the switch is the likely root cause.

  • Automate a small post-migration script on the destination host to send gratuitous ARP and flush local caches:

    !/bin/sh

    ip neigh flush all
    arping -c 3 -A -I eth0 $VM_IP
    bridge fdb flush

    Run this as a hook after migration.

Documenting troubleshooting steps

  • Log the exact commands run and outputs. Keep the switch MAC and ARP dumps with timestamps. That proves the src/dst of traffic during the event.
  • Note the switch model and firmware. If a particular switch model fails to relearn MACs, record that for replacement planning.

Root cause and remediation summary

  • Root cause is usually stale MAC or ARP entries on the physical switch or upstream router after Proxmox VM migration.
  • Remediation: clear MAC/ARP entries, send gratuitous ARP from the VM or host, or reduce aging timers. Replace unmanaged switches that do not relearn quickly.

If the VM still fails after these checks, probe further: capture traffic on the old and new switch ports, confirm VLAN configuration, and check for asymmetric routing. The steps above fix the common case where Proxmox VM migration succeeds but the physical network has not updated its MAC/ARP state.

Leave a Reply

Your email address will not be published. Required fields are marked *

Prev
Choosing between Proxmox and ESXi for virtual machines
img choosing between proxmox and esxi for virtual machines proxmox vs esxi

Choosing between Proxmox and ESXi for virtual machines

Choosing between Proxmox and ESXi for virtual machines I built an SFF PC homelab

Next
n8n | n8n@1.117.3
n8n n8n1 117 3

n8n | n8n@1.117.3

n8n Release 1

You May Also Like