Coming from OpenMediaVault, which ran perfectly fine, but as soon as I learned about ZimaOS I jumped on. That’s been about a month ago now. I’m running it on an MSI mini-pc with a Pentium 6405u CPU and 8GB of RAM. The RAM tested out perfectly fine with Memtest.
Every 3 or 4 days, ZimaOS crashes completely. No network at all, not even ping reply, no services running, dockers all down. Even the console doesn’t respond and when I connect a monitor, no signal on that either.
lspci -v reveals that I’m using the i915 driver:
00:02.0 VGA compatible controller: Intel Corporation Comet Lake-U GT2 [UHD Graphics 620] (rev 02) (prog-if 00 [VGA controller])
DeviceName: Onboard - Video
Subsystem: Micro-Star International Co., Ltd. [MSI] Device b183
Flags: bus master, fast devsel, latency 0, IRQ 133, IOMMU group 0
Memory at b0000000 (64-bit, non-prefetchable) [size=16M]
Memory at a0000000 (64-bit, prefetchable) [size=256M]
I/O ports at 3000 [size=64]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Root Complex Integrated Endpoint, IntMsgNum 0
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [100] Process Address Space ID (PASID)
Capabilities: [200] Address Translation Service (ATS)
Capabilities: [300] Page Request Interface (PRI)
Kernel driver in use: i915
Kernel modules: i915
I’ve read a post that i915 might run into issues due to powersaving. Hence I’ve made some additions to cmdlines.txt, full file now reads:
A full freeze like that usually points more toward a low-level kernel, GPU, storage, firmware, or power-management issue rather than Docker itself.
The fact you lose:
network
console
video output
makes it sound like the whole kernel is locking up.
The i915 tweaks you added were definitely worth trying, especially the PSR/DC ones, but if it’s still happening every few days I’d probably look next at:
BIOS update
disabling deep C-states / ASPM in BIOS temporarily
checking SSD firmware/model
whether hardware transcoding is enabled in Jellyfin/Plex
Also worth mentioning which exact ZimaOS version you’re on, because 1.6.x has introduced a few stability regressions for some users lately.
The fact OMV was stable on the same hardware is actually useful information too, because it points more toward a kernel/driver interaction than outright bad hardware.
Ah, forgot to mention, running 1.6.1 (plus). No bios updates, but I could check power saving settings in the bios. It has sparse settings though. At the same time, it wouldn’t take the machine 3 days to come in a certain power state would it? I can imagine it crashing all the time when certain C-states would be problematic? I don’t run Plex or Jellyfin btw, the only thing that uses GPU is Frigate, which uses vaapi on this iGPU.
Interesting that Frigate is using VAAPI on the iGPU. Honestly, that makes i915 a much stronger suspect again.
And yes, normally problematic C-states or power saving issues often show up under idle/light-load conditions after long uptime, not necessarily immediately. The system can hit a certain sleep/idle transition hours or days later depending on workload patterns.
Since you’re on 1.6.1 as well, I’d probably test one thing at a time:
temporarily disable Frigate hardware acceleration and see if stability improves
check BIOS for C-state/ASPM options
check for a newer BIOS from MSI
Because a full hard lock with no console response feels very kernel/driver level, and VAAPI+i915 has definitely caused hangs on some Intel systems before.
Coincidentally it crashed again tonight, just a few hours after the last crash. It hadn’t done so before. No bios updates to be expected anymore, this is a 8th gen ‘nuc’ from MSI who have abandoned it by now. What confuses me is that under OMV I used Frigate as well with the exact identical config as under ZimaOS, so also with vaapi, and that ran fine. Am I right in suspecting something wrong in the Zima/CasaOS distribution rather than the actual Intel gpu? Could I supply certain logs or dumps or something to investigate?
Will go through the bios settings today and report. If that fails I’ll revert vaapi to cpu.
[edit] Nope, no BIOS updates (latest is from 2023) and no power / C-state settings in there whatsoever. I’ve now disabled iGPU acceleration in Frigate. I thought it would have been enough to remove the /dev/dri/renderD128 device from being mounted in the docker container, but to my surprise without it, Frigate still was able to use the GPU. So I disabled it from the Frigate config, but still I thought mapping the /dev/dri/renderD128 device into the container was mantatory?
Honestly, the fact OMV was stable with the exact same Frigate VAAPI setup makes your suspicion pretty reasonable. That points more toward a kernel/driver/regression difference in the ZimaOS stack than outright faulty Intel hardware.
And yes, normally /dev/dri/renderD128 needs to be mapped for hardware acceleration inside the container. If Frigate still appeared to use the GPU after removing it, I’d double-check:
whether the container was fully recreated
whether another /dev/dri device is still mounted
whether Frigate was actually falling back to CPU but reporting VAAPI initialized earlier in logs
I think disabling VAAPI entirely for a few days is the right next test. If the crashes suddenly stop, that becomes a very strong clue.
As for logs, unfortunately with a full hard lock they’re often lost unless persistent journaling or netconsole is enabled beforehand. But after reboot I’d still grab:
Unfortunately, even with GPU disabled in Frigate, it still just crashed. I could try stopping frigate for a few days, but it could be any docker (I have just three, Unifi, Duplicati and Frigate). Apart from that it’s mainly used for SMB share. It baffles me that OMV ran perfectly fine for months if not a year. Ofcourse it rebooted for updates every now and then but it never crashed.
Should I try an USB nic? Not a fan of that but who knows?
If the whole machine is freezing solid, including the local console and video output, I don’t think I’d jump straight to a USB NIC. A network issue would normally kill connectivity, but the system itself should still respond locally.
The fact that disabling Frigate GPU acceleration made no difference is interesting too. At this point I’d be more inclined to suspect either a kernel/driver issue in ZimaOS 1.6.1 or some hardware compatibility issue that OMV’s kernel happened to handle better.
Personally, before buying any hardware, I’d try stopping Frigate completely for a few days. Not because I think Frigate is necessarily the cause, but because it’s one of only three containers and it’s easy to rule out.
Also, if you haven’t already, it may be worth opening a GitHub issue or support ticket with IceWhale and including:
hardware model
ZimaOS version
your kernel parameters
the fact OMV ran on the same hardware for months without a single crash
That comparison is probably the most valuable clue in the whole thread.
It turns out an update of ZimaOS overwrote cmdlinesl.txt, erasing the additions I made there. This week I got an update notification for 1.6.1, which was already on there. Yet I installed it, and I guess that reverted cmdline.txt. Should it do that in the first place? Some specific settings might be there that could potentially make the machine not boot at all properly for some users?
Anyway I put the powersaving things back in, to test with Frigate without GPU first properly before changing anything else. One thing at a time to better understand what’s messing up here.
Personally, I would not expect custom additions in cmdline.txt to survive every OS update, as boot files are often replaced as part of the update process. That said, it would be nice if ZimaOS either preserved user-added parameters or at least warned that the file will be reset.
The good news is that this also means your previous testing wasn’t really valid, because the system may have been running for days without those power-management settings actually being applied.
I think you’re taking the right approach now: put the parameters back, keep Frigate on CPU only, and change just one variable at a time. If the crashes stop, you’ll have a much clearer idea of which change actually made the difference.
Please keep us updated. Finding that cmdline.txt was reverted may turn out to be the key clue here.
That’s frustrating, but at least this gives us a clearer direction.
If it still crashed with the cmdline changes back in place and Frigate GPU acceleration disabled, then I’d probably stop focusing only on the GPU side for now.
The next thing I would do is fully stop Frigate for a few days, not just disable GPU acceleration. That way you can properly rule it in or out. Leave the system running with just ZimaOS, SMB, Unifi and Duplicati, and see if it still freezes.
If it still crashes with Frigate completely stopped, then it starts looking much more like a ZimaOS/kernel/hardware compatibility issue rather than Frigate itself.
The OMV comparison is still the big clue for me. If the same hardware ran OMV for months without freezing, but ZimaOS locks up every few days, then something in the ZimaOS side is not playing nicely with this hardware.
I’d avoid buying a USB NIC for now. If the local console and video output are also frozen, that doesn’t sound like a simple network issue. It sounds more like the whole system is locking hard.
I don’t have to buy hardware, I have piles of stock😉 but as an update, just about 5 hours after the previous crash, it just crashed again. So it’s not even days but hours this time.
I’ll shut Frigate for a while. But I think it’s more an OS / kernel, in the end Frigate is just a docker running. It shouldn’t be able to halt the whole system.
But indeed, OMV ran perfectly fine. I think I still have that SSD with it on if I could check some things. I have a 2.5’’ bay where I can easily swap disks. I could check (even for my own curiosity) what drivers and modules are loaded in OMV.
A Docker container should not normally be able to freeze the entire machine solid, especially if the local console and video output also stop responding. It can cause high load, memory pressure, or driver problems, but a full hard lock usually points deeper than the container itself.
Stopping Frigate completely is still worth doing, just to remove it from the equation. But at this stage I would also be leaning more toward OS/kernel/driver compatibility, especially because OMV ran on the same hardware without this behaviour.
If you still have the OMV SSD, that is actually a very good comparison test. Booting back into OMV and checking what kernel, drivers and loaded modules it uses could give a useful clue, especially around network, chipset, storage, and GPU related drivers.
The fact it has now gone from every few days to only a few hours also makes it harder to ignore. Something is clearly not stable on this setup under ZimaOS.
Yes, you can check the ZimaOS logs from the previous boot. That is usually the best place to start after a random reboot.
After the system comes back online, SSH into ZimaOS and run:
journalctl -b -1 -e
Then check the previous boot kernel logs:
journalctl -b -1 -k
Also check the current boot kernel messages:
dmesg -T | tail -200
If you can, please copy/paste the output here, especially anything mentioning errors, warnings, reboot, panic, watchdog, thermal, power, disk, I/O, or kernel messages.
Also include these details so IceWhale can compare cases:
ZimaOS version:Hardware model:Does it reboot by itself or freeze until you power cycle it:Docker apps running:Did this start after updating:
If the reboot is very sudden, the logs may not show everything, but this is the right first step so IceWhale can see whether it looks like an OS/kernel issue, hardware issue, or something triggered by a container.
For what it’s worth here are the logs as of my system at this point. It crashed yesterday around 17:00 CET. (and before that at around 12:00 CET). I don’t see any obvious errors in the logs, but didn’t go through them very thouroughly. As the system hangs completely, even without the local console or even video output working at all, I can imagine it just stalls, and doesn’t log anything at all. But we’ll see
I had a quick look and I don’t see an obvious smoking gun like a kernel panic, OOM kill, disk I/O error, or thermal shutdown just before the crash.
What does stand out is that the logs appear to simply stop before the crash, then the next log starts with a fresh kernel boot. That lines up with what you described: the machine is locking hard enough that it does not get a chance to write anything useful to the logs.
I also noticed your boot command line does include the power-management changes, so at least those were active during this test.
There are some i915/Intel graphics related entries during boot, including the i915 parameters and kernel taint messages, but I would not say that proves the GPU/display driver is the cause. It is just something worth IceWhale looking at, especially because you lose local console/video output as well.
At this stage I think your Frigate-off test is still the right next step. If it still locks with Frigate completely stopped, then this looks much more like a ZimaOS/kernel/driver compatibility issue on this hardware rather than an application issue.
I’m having similar issues. I’ve managed to get it down to roughly about once every 27 hours give or take. I had a thread here but its still happening somewhat regularly. In the latest tests it seemed like my unused NVME was causing the lockups because it was stalling (and or the blu-ray drive) but since disabling both it has still crashed so I’m not sure what’s actually happening. Didn’t use to happen months ago but it was like a switch. Might have to seriously look at switching to another OS since its just such a pain dealing with it and seemingly no amount of troubleshooting will fix it.
Hi, this boot.log log does not show logs about system crashes or inaccessible logs. If the system crashes, please run the following command after the next boot to collect the previous boot’s system logs and send the generated file to us or sent it to my email(dina@icewhale.org:
sudo journalctl -b -1 -n 500 > /DATA/logs.txt
In addition, you can run the following command:
ln -s /var/log/journal/ /DATA/journal
After that, open the journal folder in Files and send us all the logs inside. These logs contain records from previous system shutdowns. Once the logs have been sent, you may delete the journal folder from Files.