I was having periodic problems with my Zimacube Pro running TrueNAS Scale. The box would become non-responsive every 24-40 hours during which time a monitor plugged into it would show scrambled text, the keyboard is non-responsive, and there was no response to a ping or SSH. I removed the top and this seemed to improve things a bit, and then I got the cooler upgrade and installed that. Now the box is relatively stable, but I still hangs about once a week. If I reattach the case top, the unit goes off-line within a day and when I remove the top, the CPU cooler and memory is hot enough that I can’t touch them.
I have six Seagate 11 TB spinning drives and one 2 TB NVMe. This is the Pro box with 64 GB or RAM.
If anyone else is having overheating problems when running TrueNAS Scale, I’d love to hear about them and/or suggestions for fixing this. I notified Icewhale about the overheating and purchased the upgraded CPU cooler, but haven’t notified them of the overheating after the upgrade.
According to the scrambled text you mentioned, I have several suspicions. One is that the memory is damaged, and the other is that the system SSD is overheated. The official has released a test document, and I think you can try to do a test based on this document. In addition, what will be the device temperature after replacing the official upgraded radiator? When I am using it, the temperature is usually stable between 50 and 60 ℃.
This is the CPU temp with the upgraded cooler but the top off of the Cube. If I put the top on, the temp rises a lot (+20C) and the system crashes in 12-24 hours. I hadn’t thought about the boot SSD. I am not running the stock SSD as I was having issues with ZimaOS, but didn’t want to save the SSD in case I wanted to return to using it. So I purchased a new Kingston NV2 500G M.2 2280 NVMe and installed that to run TrueNAS. I haven’t had any problems with the Kingston NVMe sticks but abnormally high temps could cause an issue.
I did some more checking. The average temp on the boot SSD is 45C with the top off and 50C with the top on (< 30 min). Also after I put the top on, the CPU temp went from 35C to 43C. The chart below shows 24 hours of CPU temp. It seems stable, but it also looks like putting a case fan in the upper chamber of the cube is a missed opportunity.
It seems that the whole machine is not working in an abnormal temperature environment. Generally speaking, 50 degrees is not a bad value when the device is working. I still recommend trying to test the stability of the memory. In addition, because the new radiator uses a silent fan, if you need higher cooling efficiency, you can adjust the temperature pwm in the BIOS to 1, increase the fan speed at low temperatures, and change the full speed temperature to 90 degrees.
I booted with memtest86+ and indeed it finds bad memory addresses almost immediately when I have both DIMMs (64GB). In fact, at times memtest86+ only runs for a few minutes and the cube reboots after showing many low memory errors.
I have tried to run a single DIMM (32 GiB) and not gotten any errors regardless of the DIMM or the slot it is in. I have tried all combinations of DIMMs and slots. With the top on, the single DIMM (on top) gets very hot, but I don’t get any memory errors.
However, if I run memtest86+ with both DIMMs installed (doesn’t matter which is on top) I get errors. Here is a screen shot of the test with both DIMMs and the top closed after 75 minutes:
I think there is a heat problem when both DIMMs are operating. The problem is less acute with the top open but with it closed errors are almost immediate. Here is a thermal image of the DIMMs looking at them from the left side of the cube with the CPU heat sink in the background:
This shows hots spots on the visible parts of the top DIMM at 107 deg C and the lower one at 116 C. Which I think is warm. I imaging the temperature of the lower DIMM is higher. Note that when I run a single DIMM, the temperature of the top DIMM runs about 100 C in the top slot.
This implies to me that the DIMMs are fine and that the NVMe card is also fine since I can operate with one DIMM as long as I want. This also implies that either the DIMMs are not capable of the head by the arrangement on the mainboard or that there is a design flow in the spacing of the DIMMs (although the arrangement seems similar to other mainboards) or in the air flow available to them.
I’ve been advised against using heat speaders on the DIMMs, but unless there is a quick fix, I’m going to need to send the cube back. This seems like a design flaw.
Very detailed information. It is normal to have high temperature during the test, but generally, under normal temperature, the actual maximum temperature of the memory is expected to be below 80 degrees. Therefore, we currently believe that the memory is abnormally hot. We recommend that you contact support@icewhale.org for further RMA services.
Fitted Noctua NF-S12A ULN to the top. I tried a 14 cm fan but no better than the 12 cm fan which is quieter and uses less power… If I turn off top fan CPU temp rises by 5 degrees quickly.
I have ordered a Thermalright AXP90 X36 Low Profile ITX CPU Cooler. I may not need to use it.
I just ran a unRaid parity build and the MB and CPU stayed cool, the original blower did not get noisy. The Cache is a Samsung Pro with heatsink, it would not fit into NVME tray so is mounted on the MB. Cache_2 is a Samsung NVME mounted on tray7 and cool. The array HDD all go to around 50 degrees when running parity, ambient temp of 22 degrees C.
I fitted 2 Noctua 80mm NF-A8 PWM 2200RPM Fans to the HDD caddy and moved one of the original little noisy fans to the side on a Noctua noise reducing cable. I don’t own a printer so will have to build a cover for the fans.
I have to make a new top and repaint the job when finished. I have stainless steel woven wire mesh coming to cover the square openings.
The rectangular hole is 1400 by 400mm and 1560 sq mm
The 330 by 4mm holes on the other side have an area of 4147 sq mm.
I asked AI if the 330 holes could handle the 12cm Noctua fan flow.
Hi Cooltiger. You have certainly done a lot to re-engineer the cube and I think you are on the right track. the top fan is probably the best idea to fix my immediate problem, but the disk fans are also a good idea. My cube is in a location where I don’t care about noise, but I have six disks that do get hotter than I like when writing.
My issue is that after purchasing this, I don’t really think I should need to re-engineer it. This is replacing a homebrew NAS that has worked fine for 7 years, so I expect an upgrade in performance and issues. I really like the Icewhale hardware products to date, so this seems like a good bet. Unfortunately, it’s a mission critical piece of gear that I have’t been able to use as such since early August.
I should add that the replacement cooler from IceWhale is much much better than the cooler that shipped with the Cube. Quiet and dropped the CPU temp by 3-5C. I may only have a heat related problem, but it may be something else or in addition too. The heat issue aggravates whatever is wrong. The unit will freeze (screen too) even when I first start it up and it’s relatively cool. So it could be an intermittent part. I’ve swapped what I could, but it’s probably time to swap the main board.
Since I last posted, the Icewhale support folks have decided that my memory was faulty and sent me another set of modules. They solve the heat problem but unlike the original memory modules, these only work in the lower memory slot. If I install both, I can only see 32 GB or RAM. When I put in one at a time, I can see that only the lower slot works. If I put one of the original modules in the upper and the new memory in the lower, I get memory errors in upper memory. So still no solution, but some progress.
I noticed Icewhale several times starting two weeks ago about the outcome, but it’s been radio silence from them since. Still upset about the problem, but getting more upset about the support.