I have recently built a server, simply to use as a large (100 TB) iSCSI host. A low-end but still a Xeon 6-core CPU, 8 GB RAM, and a caching RAID controller driving two arrarys, a couple of boot SSDs in RAID-1 and a RAID-6 data array across 16 SATA disks. This is a decent machine, based on a Dell PowerEdge R730xd. It should have been good enough for the task. It wasn’t.
Once I’d completed initial setup during which it seemed fine, It was dreadfully, appallingly slow. Booting took over 10 minutes. Logging in another four or five minutes. Opening a command line a couple of minutes. A PowerShell prompt took nearly half an hour to fully appear, and a shutdown took almost as long.
The server had only 8 GB RAM, which has been plenty in the past, just to sit there and share iSCSI targets to initiators. This time however performance just doing the basics was so awful I couldn’t even think about putting it into production. I couldn’t even retain a stable RDP session to see what was going on!
RAM utilisation seemed unduly high – indeed, just being booted and logged in, it was using over 11 GB RAM and slowly climbing. This was a problem, as the server only has 8 GB RAM in total.
The hard drive activity lights were constantly blinking too, and not just the boot SSDs which I would expect to be busy with swap activity. The main data drives all were working very hard, despite no activity being apparently needed – the iSCSI service wasn’t even installed at this point, and no file writing or reading was going on on the main data volume.
But neither Task Manager nor Resource Manager were any help in identifying what was using the RAM or accessing the disks. My only clue was in the System log in Event Viewer:
Event ID: 134
The file system was unable to write metadata to the media backing volume D:. A write failed with status “STATUS_SUCCESS” ReFS will take the volume offline. It may be mounted again automatically.
Very strange, especially as it implies a successful write being a failure. A little way further down the log was the key to the answer, however:
Event ID: 135
Volume D: is formatted as ReFS but ReFS is unable to mount it; ReFS encountered status The working set is not big enough to allow the requested pages to be locked..
It turns out the problem was the 100 TB ReFS-formatted drive I had made ready for my iSCSI VHDX files. Previously I’d used multiple smaller NTFS drives in this server. What I didn’t know is that ReFS uses a lot of resources to manage itself.
There were two possible solutions: Reformat the huge drive into the less-reliable NTFS, or give the server more resources. I had four 8 GB sticks of the right type of RAM lying around, so I installed it.
Upgrading from 8 GB to 32 GB RAM proved to be an instant fix. Booting was normal, the server was back to behaving at the speeds it should be.
I created a 50 TB VHDX file for iSCSI, two hours later and nothing going on, 23.6 GB of RAM is in use and the disks are idling. Still plenty of RAM left over to run the iSCSI service, handle the traffic through the 10 Gb NICs, and maybe even cache some disk.
There are some ways to manage the way ReFS does its metadata handling (the reason for the high RAM utilisation) listed in this Microsoft Support Page, which might help further. I didn’t need to use these fixes, and their use requires a much deeper understanding of the inner workings of ReFS metadata than I have a need to know in this instance.
However, there is a lesson learned – if you’re wanting to use ReFS on a server with low RAM, just don’t – either install more RAM or rethink using ReFS!