A fast fileserver with FreeBSD, NVMEs, ZFS and NFS // Hannes Hauswedell

I have a small server running in my flat that serves files locally via NFS and remotely via Nextcloud. This post documents the slightly overpowered upgrade of the hardware and subsequent performance / efficiency optimisations.

TL;DR

I can fully saturate a 10Gbit LAN connection, achieving more than 1100 MiB/s throughput.
I can perform a zpool scrub with 11 GiB/s, completing a 6.8TiB scrub in 11min.
Idle power usage can be brought down to 34W.

Old setup and requirements

What the server does:

Serve files via NFS to
- my workstation (high traffic)
- a couple of Laptops (low traffic)
- the TV running Kodi (medium traffic)
Host a Nextcloud which provides file storage, PIM etc. for a handful of people

Not a lot of compute is necessary, and I have tried to keep power usage low. The old hardware served me well really long:

AMD 630 CPU
16GiB RAM
2+1 * 4TiB spinning disk RAIDZ1 with SSD ZIL (“write-cache”)

The main pain point was slow disk access resulting in poor performance when large files were read by the Nextcloud. Browsing through my photo collection via NFS was also very slow, because thumbnail generation needed to pull all the images. Furthermore, low speed meant that I was not doing as much on the remote storage as I would have liked (e.g. storing games), resulting in my workstation’s storage always running out. And I was just reaching the limits of my ZFS pool anyway, so it was time for an upgrade!

New setup

To get better I/O, I thought about switching from HDD to SSD, but then realised that SSD performance is very low compared to NVME performance, although the price difference is not that much. Also, NFS+ZFS leads to quite a bit of I/O, typically requiring the use of faster caching devices, further complicating the setup. Consequently, I decided to go for a pure NVME setup. Of course, the new server would also need 10GBit networking, so that I can use all that speed in the LAN!

This is the new hardware! I will discuss the details below.

Mainboard, CPU and RAM

The main requirement for the mainboard is to offer connectivity for four NVME disks. And to be prepared for the future, I would actually like 1-2 extra NVME slots. There are two ways to attach NVMEs to a motherboard:

directly (“natively”)
via an extension card that is plugged into a PCIexpress slot

Initially, I had assumed no mainboard would offer sufficient native slots, so I did a lot of research on option 2. The summery: it is quite messy. If you want to use a single extension card that hosts multiple NVMEs (which is required in this case), you need so called “bifurcation support” on the mainboard. This lets you e.g. put a PCIe x8 card with two NVME 4x disks into a PCIe 8x slot on the mainboard. However, this feature is really poorly documented,¹ and and varies between mainboard AND CPU whether they support no bifurcation, only 8x → 4x4x or also 16x → 4x4x4x4x. The different PCIe versions and speeds, and the difference between the actually supported speed and the electrical interface add further complications.

In the end, I decided to not do any experiments and look for a board that natively supports a high number of NVME slots. For some reasons, this feature is very rare on AMD mainboards, so I switched to Intel (although actually I am a bit of an AMD fanboy). I probably could have gone with a board that has 5 slots, but I use hardware for a long time and wanted to be safe, so I took board that has 6 NVME slots (2 free slots):

ASRock Z790 NOVA WiFi

None of the available boards had a proper² 10GBit network adaptor, so having a usable PCIe slot for a dedicated card was also a requirement. It is important to check whether PCIe slots can still be used when all NVME slots are occupied; sometimes they internally share the bandwidth. But for the above board this is not the case.

Important: To be able to boot FreeBSD on this board, you need to add the following to /boot/device.hints:

hint.uart.0.disabled="1"
hint.uart.1.disabled="1"

For the CPU, I just went with something on the low TDP end of the current Intel CPU range, the Intel Core i3-12100T. Four cores + four threads was exactly what I was looking for, and 35W TDP sounded good. I paired that with some off-the-shelf 32GiB RAM kit.

Case, power supply & cooling

Strictly speaking a 2U case would have been sufficient, but I thought a 3U case might offer better air circulation. I ended up with the Gembird 19CC-3U-01. For unknown reasons, I chose a 2U horizontal CPU fan, instead of a 3U one. The latter would definitely have provided better airflow, but since the fan barely runs at all, it doesn’t make much of a difference.

I was unsuccessful in finding a good PSU that is super efficient in the average case of around 40W power usage but also covers spikes well above 100W, so I just chose the cheapest 300W one I could get :)

The case with everything in place.

The built in fans are very noisy. I chose to replace one of the intake fans with a spare one I had lying around and only connect one of the rear outtake fans. But I added an extra fan where the extension slots are to divert some airflow around the NIC—which otherwise gets quite warm. This should also blow some air over the NVME heatsinks! All fans can be regulated and fine-tuned from the BIOS of the mainboard which I totally recommend you do. At the current temperatures and average workloads the whole setup is almost silent.

Storage

Now, the fun begins. Since I needed more space than before, I clearly want a 3+1 x 4TiB RAIDZ1.

My goal was to be able to saturate a 10GBit connection (so get around 1GiB/s throughput) and still have the server be able to serve the Nextcloud without slowing down significantly. Currently the WAN upload is quite slow, but I hope to have fibre in the future. In any case, I thought that any modern NVME should be fast enough, because they all advertise speeds of multiple GiB/s.

Choice of disks

Anyway, I got two Crucial P3 Plus 4TB (which were on sale at Amazon for ~190€), as well as two Lexar NM790 4TB (which were also a lot cheaper than they are now). My assumption that that they were very comparable, was very wrong:

Disk	IOPS rand-read	IOPS read	IOPS write	MB/s read	MB/s write	“cat speed” MB/s
Crucial	53,500	794,000	455,000	2,600	4,983	~700
Lexar	53,700	796,000	456,000	4,578	5,737	~2,700

I used this fellow’s fio-script to generate all columns except the last. The last column was generated by simply cat’ing a 10GiB file of random numbers to /dev/null which roughly corresponds to the read portion of copying a 4k movie file. Since I had two disks each, I actually took the time to test all of them in different mainboard slots, but the results were very consistent: in real-life tasks, the Crucial disk underperformed significantly, while the Lexar disks were super fast. I decided to return the Crucial disks and get two more by Lexar 😎

Disk encryption

I always store my data encrypted at rest. FreeBSD offers GELI block-level encryption (similar to LUKS on Linux). But OpenZFS also provides a dataset/filesystem-level encryption since a while. I previously used GELI, but I wanted to switch to ZFS native encryption, because it provides some advantages:

Flexibility: I can choose later which datasets to encrypt; I can encrypt different datasets with different keys.
Zero-knowledge backups: I can send incremental backups off-site that are received and fully integrated into the target pool without that server ever getting the decryption keys.
Forward-compatibility: I can upgrade to better encryption algorithms later.
Linux-compatibility: I can import the existing pool in a Linux environment for debugging or benchmarking.

However, I had also heard that ZFS native encryption was slower, so I decided to do some benchmarks:

Disk	IOPS rand-read	IOPS read	IOPS write	MB/s read	MB/s write	“cat speed” MB/s
no encryption	54,700	809,000	453,000	4,796	5,868	2,732
geli-aes-256-xts	40,000	793,000	446,000	3,332	3,334	952
zfs-enc-aes-256-gcm	26,100	513,000	285,000	3,871	4,648	2,638
zfs-enc-aes-128-gcm	29,300	532,000	353,000	3,971	4,794	2,631

Interestingly, GELI³ performs much better on the IOPS, but much worse on throughput, especially on our real-life test case. Maybe some smart person knows the reason for this, but I took this benchmark as an assurance that going with native encryption was the right choice.⁴ One reason for the good performance of the native encryption seems to be that it makes use of the CPU’s avx2 extensions.

At this point, I feel like I do need to warn people about some ZFS encryption related issues that I learned about later. Please read this. I have had no problems to date, but make up your own mind.

RaidZ1

recordsize	compr.	encrypt	IOPS rand-read	IOPS read	IOPS write	MB/s read	MB/s write	“cat speed” MB/s
128 KiB	off	off	50,000	869,000	418,000	3,964	5,745	2,019
128 KiB	on	off	49,800	877,000	458,000	3,929	4,654	1,448
128 KiB	off	aes128	26,300	484,000	230,000	3,589	5,331	2,142
128 KiB	on	aes128	27,400	501,000	228,000	3,510	3,927	2,120

These are the numbers after creation of the RAIDZ1 based pool. They are quite similar to the numbers measured before. The impact of encryption on IOPS is clearly visible, less so on sequential read/write throughput. Compression seems to impact write throughput but not read throughput which is expected for zstd. It is unclear why “cat speed” is lower here.

recordsize	compr.	encrypt	IOPS rand-read	IOPS read	IOPS write	MB/s read	MB/s write	“cat speed” MB/s
1 MiB	off	off	7,235	730,000	404,000	3,686	3,548	2,142
1 MiB	on	off	7,112	800,000	470,000	3,624	3,447	2,064
1 MiB	off	aes128	3,259	497,000	258,000	3,029	3,422	2,227
1 MiB	on	aes128	3,697	506,000	249,000	3,137	3,361	2,237

Many optimisation guides suggest setting the zfs recordsize to 1 MiB for most use-cases, especially storage of media files. But this seems to drastically penalise random read IOPS while providing little to no benefit in the sequential read/write scenarios. This is actually a bit surprising and I will need to investigate this more. Is it perhaps because NVMEs are good at parallel access and therefor suffer less from fragmentation anyway?

In any case, the main take away message is that overall read and write throughputs are over 3,000 MiB/s in the synthetic case and over 2,000 MiB/s in the manual case, which is great.

Other disk performance metrics

Operation	Speed [MiB/s]
Copying 382 GiB between two datasets (both enc+comp)	1,564
Copying 505 GiB between two datasets (both enc+comp)	800
`zfs scrub` of the full pool	11,000

These numbers further illustrate some real world use-cases. It’s interesting to see the difference between the first two, but it’s also important to keep in mind that this is reading and writing at the same time. Maybe some internal caches are exhausted after a while? I didn’t debug these numbers further, but I think the speed is quite good after such a long read/write.

More interesting is the speed for scrubbing, and, yes, I have checked this a couple of times. A scrub of 6.84TiB happens in 10m - 11m, which is pretty amazing, I think, considering that it is reading the data and calculating checksums. I am assuming that sequential read is just very fast and that access to the different disks happens in parallel. The checksum implementation is apparently also avx2 optimised.

LAN

Network adapter

Based on recommendations, I decided to buy an Intel card. Cheaper 10GBit network cards are available from Marvell/Aquantia, but the driver support in FreeBSD is poor, and the performance is supposedly also not close to that of Intel.

Many people suggested I go for SFP+ (fibre) instead of 10GBase-T (copper), but I already have CAT7 cables in my flat. While I could have used fibre purely for connecting the server to the switch (and this would likely save some power), I would have had to buy a new switch and the options were just not economical—I already have a switch with two 10GBase-T ports which I had bought for exactly this setup.

The cheapest Intel 10GBase-T card out there is the X540 which is quite old and available on Amazon for around 80€. I bought two of those (one for the server and one for the workstation). More modern cards are supposedly more energy efficient, but also a lot more expensive.⁵

NFS Performance

On server and client, I set:

kern.ipc.maxsockbuf=4737024 in /etc/sysctl.conf
mtu 9000 media 10gbase-t in the /etc/rc.conf (ifconfig)

Only on the server:

nfs_server_maxio="1048576" in /etc/rc.conf

Only on the client:

nfsv4,nconnect=8,readahead=8 as the mount options for the nfs mount.
vfs.maxbcachebuf=1048576 in /boot/loader.conf (not sure any more if this makes a difference).

These settings allow larger buffers and increase the amount of readahead. This favours large sequential reads/writes over small random reads/writes.

The full options on the client end up being:

# nfsstat -m
X.Y.Z.W:/ on /mnt/server
nfsv4,minorversion=2,tcp,resvport,nconnect=8,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=1048576,wsize=1048576,readdirsize=1048576,readahead=8,wcommitsize=16777216,timeout=120,retrans=2147483647

I use NFS4 for my workstation and NFS3 for everyone else. I have performed no benchmarks on NFS3, but I see no reason why it would be slower.

IOPS rand-read	IOPS read	IOPS write	MB/s read	MB/s write	“cat speed” MB/s
283	292,000	33,200	1,156	594	1,164

This benchmark was performed on a dataset with 1M recordsize, encryption, but no compression. Random read IOPS are pretty bad, and I see a strong correlation here to the rsize (if I halve it, I double the IOPS; not shown in table). It’s possible that every 4KiB read actually triggers a 1MiB read in NFS which would explain this. On the other hand, the sequential read and write performance is pretty good with synthetic and real world read speeds being very close to the theoretical maximum of the 10GBit connection.

One thing to keep in mind: The blocksize when reading has a very strong impact on the performance. This can be seen when using dd with different bs arguments. Of course, 1MiB is optimal if that is also used by NFS, and cat seems to do this. However, cp does not which results in a much slower performance than if using dd if=.. of=.. bs=1M.

I have done measurements with plain nc over the wire (also reaching 1,160 MiB/s) and iperf3 which achieves 1,233 MiB/s just below the 1,250 MiB/s equivalent of 10Gbit.

Power consumption and thermals

For a computer running 24/7 in my flat, power consumption is of course important. I bought a device to measure power consumption at the outlet to get an accurate picture.

idle

Because the computer is idle most of the time, optimising idle power usage is most important.

Change	W/h
default	50
`*_cx_lowest="Cmax"`	45
disable WiFi and BT	42
`media 10gbase-t`	45
`machdep.hwpstate_pkg_ctrl=0`	41
turn on chassis fans	42
ASPM modes to L0s+L1 / enabled	34

I assume that the same setup on Linux would be slightly more efficient, but 34W in idle is acceptable.

Clearly, the most impactful changes were:

Activating ASPM for the PCIe devices in the BIOS.
Adding performance_cx_lowest="Cmax" and economy_cx_lowest="Cmax" to /etc/rc.conf.
Adding machdep.hwpstate_pkg_ctrl=0 to /boot/loader.conf.

You can find online resources on what these options do. You might need to update the BIOS to be able to disable WiFi and Bluetooth devices completely. You can also use hints in the /boot/device.hints, but this doesn’t save as much power.

Using 10GBase-T speed on the network device (instead of 1000Base-T) unfortunately increases power usage notably, but there is nothing I could find to mitigate this.

Things that are often recommended but that did not help me (at least not in idle):

NVME power states (more on this below)
lower values for sysctl dev.hwpstate_intel.*.epp (more on this below)
hw.pci.do_power_nodriver=3

idle temperatures	°C
CPU	37-40
NVMEs	52-55

The latter was particularly interesting, because I had heard that newer NVMEs, and especially those by Lexar get very warm. It should be noted though, that the mainboard comes with a large heatsink that covers all NVMEs.

under load

The only “load test” that I performed was a scrub of the pool. Since this puts stress on the NVMEs and also the CPUs, it should be at least indicative of how things are going.

during `zpool scrub`	°C
CPU	55-59
NVMEs	69-75

The power usage fluctuates between 85W and 98W. I think all of these values are acceptable.

NVME power state hint	scrub speed GiB/s	W/h
0 (default)	11	< 100
1	8	< 93
2	4	< 70

You can use nvmecontrol to tell the NVME disks to save energy. More information on this here and here. I was surprised that all of this works reliably on FreeBSD, but it does! The man-page is not great though. Simply call nvmecontrol power -p X nvmeYns1 to set the hint to X on device Y, if desired. Note that this needs to be repeated after every reboot.

`dev.hwpstate_intel.*.epp`	scrub speed GiB/s	W/h
50 (default)	11.0	< 100
100	3.3	< 60

You can use the dev.hwpstate_intel.*.epp sysctls for you cores to tune the eagerness of that core to scale up with higher number meaning less eagerness.

In the end, I decided not to apply any of these “under load optimisations”. It is just very difficult, because, as shown, all optimisations that reduce watts per time also increase time. I am not certain of any good ways to quantify this, but it feels like keeping the system at 70W for 30min instead of 100W for 10min, is not really worth it. And I kind of also want the system to be fast, that’s why I spent so much money on it 🙃

The CPU does have a cTDP mode that can be activated via the BIOS and which is “worth it”, according to some articles I have read. I might give this a try in the future.

Final remarks

What a ride! I spent a lot of time optimising and benchmarking this and I am quite happy with the outcome. I am able to exhaust the 10GBit LAN connection completely, and still have resources left on the server :)

Thanks to the people at www.bsdforen.de who had quite a few helpful suggestions.

If you see anything that I missed, or have suggestions on how to improve this setup, let me know in the comments!

Footnotes

With ASUS being the only exception. ↩︎
Proper in this context means well-supported by FreeBSD and with a good performance. Usually, that means an Intel NIC. Unfortunately all the modern boards come Marvell/Aquantia AQtion adaptors which are not well-supported by FreeBSD. ↩︎
The geli device was created with: geli init -b -s4096 -l256 ↩︎
I wanted to perform all these tests with Linux as well, but I ran out of time 🙈 ↩︎
I did try a slightly more more modern adapter with Intel 82599EN chip. This is a SFP+ chip, but I found an adaptor with built-in 10GBase-T for around 150€. It ended up having some driver issues (you needed to plug and unplug the CAT cable for the device to go UP), and it used more energy than the X540, so I sent it back. ↩︎

A fast fileserver with FreeBSD, NVMEs, ZFS and NFS

Table of contents