Dear Community,
I’ve installed a CasaOS VM on a Proxmox (v9.x) lab server. This VM replaces a previous OpenMediaVault VM that accidentally crashed and couldn’t be recovered (physical disk issue due to a storm). The OMV VM has run for months (approx. 18 months) without any issue.
Now the CasaOS VM is running less that a week and the server disconnects from the Proxmox cluster (this is how I detect the disconnection: SRV1 notifies me that SRV2 has been lost).
Metrics give: CPU usage = 5%, RAM < 25%, temp is normal…
I don’t understand why it has this weird behavior since I installed CasaOS and am keen to investigate: system logs don’t return any relevant information (some warnings “pcieport 0000:00:1b.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)”), which logfile should I look into?
Thanks in advance for any help!
I would not jump straight to blaming CasaOS itself yet, especially if the Proxmox node is actually dropping from the cluster rather than just the VM becoming unreachable.
The important clue here is this:
pcieport ... PCIe Bus Error: severity=Correctable
“Correctable” PCIe errors usually mean the hardware/link recovered, but repeated PCIe bus errors can still point to underlying instability:
- failing NVMe/SATA device
- bad PCIe lane/link training
- power issue after the storm
- flaky RAM
- motherboard/slot instability
- NIC issues
- ASPM/power-management quirks
Since the previous OMV VM died from a physical disk issue after the storm, I would honestly suspect lingering hardware instability before blaming CasaOS.
A few things I would check first:
- Confirm whether the entire Proxmox node is briefly freezing/disconnecting, or only the CasaOS VM.
- Check the Proxmox host logs directly, not only inside the VM:
journalctl -b -1 -p err
dmesg -T | grep -iE "pcie|aer|nvme|reset|link down|I/O error"
pveproxy status
- Verify storage health on the Proxmox host:
smartctl -a /dev/nvme0n1
or for SATA:
smartctl -a /dev/sdX
- Watch for cluster/network drops:
journalctl -u corosync
- If you are passing through disks/controllers to the VM, temporarily remove passthrough and test with a normal virtual disk only.
Also worth noting:
- CasaOS itself is fairly lightweight.
- A normal CasaOS VM should not destabilize a healthy Proxmox node.
- Cluster disconnects are often networking, storage stalls, or kernel/hardware related underneath.
The PCIe errors are the part I would focus on first.
Thanks a lot for this detailed (and quick) answer!
I’m leaving for a business trip this afternoon, so I won’t have the time to follow your advices and investigate deeper into the system, but will get back in touch next week with the outcome!
Thanks again!