[ZimaBoard 2] NVMe controller errors with Crucial P3 SSDs on dual NVMe adapter - TrueNAS Scale 25.10.2

Hello everyone,

I’m reaching out to the community for help with a persistent issue I’m facing on my ZimaBoard 2. I’ve done quite a bit of troubleshooting already, but the problem still appears occasionally, and I’m hoping someone here might have experience with a similar setup.

Hardware Setup:

  • Board: ZimaBoard 2 16Gb ram
  • Storage Adapter: Zima PCIe 3.0 x4 to Dual NVMe M.2 adapter card
  • SSDs: 2x Crucial P3 1TB NVMe SSDs
  • OS Drive: 128GB SATA SSD (for TrueNAS)
  • Additional Disk: Crucial BX500 1TB SATA SSD (for local snapshots/backups)

Software:

  • OS: TrueNAS Scale 25.10.2 (fresh install)
  • Kernel: 6.12.33-production+truenas

The Problem:

I occasionally see the following errors in the console or logs:

nvme nvme0: controller is down: will reset CSTS=0x3, PCI_STATUS=0x10
nvme nvme0: resetting controller due to persistent internal error

After the error, the system usually recovers (the controller resets), but it’s clearly a sign of instability. The chip on the NVMe adapter gets very hot to the touch (can’t keep a finger on the heatsink), though the SSDs themselves remain cool.

What I’ve Tried So Far:

  1. Sysctl/kernel parameters:

    • Added pcie_aspm=off (confirmed active in /proc/cmdline)

    • Tried adding nvme_core.default_ps_max_latency_us=0 and pcie_port_pm=off via the TrueNAS web UI (Sysctl with UDEV type), but later discovered these are not proper sysctl variables.

    • Then applied them correctly as kernel_extra_options via the midclt command:

      midclt call system.advanced.update '{"kernel_extra_options": "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"}'
      

    After a reboot, I verified that all three parameters are now present in /proc/cmdline and that cat /sys/module/nvme_core/parameters/default_ps_max_latency_us returns 0.

    Result: The error frequency has decreased significantly, but it hasn’t disappeared completely.

What I Haven’t Tried Yet:

  • Firmware update: I haven’t updated the Crucial P3 firmware. I plan to do it, but I need to find a Windows machine for that.
  • Active cooling: Adding a small fan pointing at the adapter.
  • Testing with a single NVMe to isolate a potential power delivery issue.

My Questions:

  1. Has anyone here experienced similar nvme controller is down errors with a ZimaBoard + dual NVMe adapter + Crucial P3 combo?
  2. Could the adapter chip overheating be a normal behavior, or is it a red flag? Should I prioritize active cooling?
  3. Does anyone know if there’s a known firmware issue with Crucial P3 drives that might cause this? (I’ll update it anyway, but curious if others have seen improvements after updating.)
  4. Could this be a power delivery limitation of the ZimaBoard’s PCIe slot? The adapter draws power from the slot, and two high-performance NVMe drives might be too much.
  5. Is there any other kernel parameter or BIOS setting I should try before concluding it’s a hardware issue?

Additional Info:

  • The SSDs are brand new and pass smartctl long tests.
  • I’m aiming for a mirror pool with these two drives for data redundancy.

Any insights, experiences, or suggestions would be greatly appreciated. Thanks in advance for your help!