A fast fileserver with FreeBSD, NVMEs, ZFS and NFS
Table of contents
I have a small server running in my flat that serves files locally via NFS and remotely via Nextcloud. This post documents the slightly overpowered upgrade of the hardware and subsequent performance / efficiency optimisations.
TL;DR
- I can fully saturate a 10Gbit LAN connection, achieving more than 1100 MiB/s throughput.
- I can perform a
zpool scrub
with 11 GiB/s, completing a 6.8TiB scrub in 11min. - Idle power usage can be brought down to 34W.
Old setup and requirements
What the server does:
- Serve files via NFS to
- my workstation (high traffic)
- a couple of Laptops (low traffic)
- the TV running Kodi (medium traffic)
- Host a Nextcloud which provides file storage, PIM etc. for a handful of people
Not a lot of compute is necessary, and I have tried to keep power usage low. The old hardware served me well really long:
- AMD 630 CPU
- 16GiB RAM
- 2+1 * 4TiB spinning disk RAIDZ1 with SSD ZIL (“write-cache”)
The main pain point was slow disk access resulting in poor performance when large files were read by the Nextcloud. Browsing through my photo collection via NFS was also very slow, because thumbnail generation needed to pull all the images. Furthermore, low speed meant that I was not doing as much on the remote storage as I would have liked (e.g. storing games), resulting in my workstation’s storage always running out. And I was just reaching the limits of my ZFS pool anyway, so it was time for an upgrade!
New setup
To get better I/O, I thought about switching from HDD to SSD, but then realised that SSD performance is very low compared to NVME performance, although the price difference is not that much. Also, NFS+ZFS leads to quite a bit of I/O, typically requiring the use of faster caching devices, further complicating the setup. Consequently, I decided to go for a pure NVME setup. Of course, the new server would also need 10GBit networking, so that I can use all that speed in the LAN!
This is the new hardware! I will discuss the details below.
Mainboard, CPU and RAM
The main requirement for the mainboard is to offer connectivity for four NVME disks. And to be prepared for the future, I would actually like 1-2 extra NVME slots. There are two ways to attach NVMEs to a motherboard:
- directly (“natively”)
- via an extension card that is plugged into a PCIexpress slot
Initially, I had assumed no mainboard would offer sufficient native slots, so I did a lot of research on option 2. The summery: it is quite messy. If you want to use a single extension card that hosts multiple NVMEs (which is required in this case), you need so called “bifurcation support” on the mainboard. This lets you e.g. put a PCIe x8 card with two NVME 4x disks into a PCIe 8x slot on the mainboard. However, this feature is really poorly documented,1 and and varies between mainboard AND CPU whether they support no bifurcation, only 8x → 4x4x or also 16x → 4x4x4x4x. The different PCIe versions and speeds, and the difference between the actually supported speed and the electrical interface add further complications.
In the end, I decided to not do any experiments and look for a board that natively supports a high number of NVME slots. For some reasons, this feature is very rare on AMD mainboards, so I switched to Intel (although actually I am a bit of an AMD fanboy). I probably could have gone with a board that has 5 slots, but I use hardware for a long time and wanted to be safe, so I took board that has 6 NVME slots (2 free slots):
None of the available boards had a proper2 10GBit network adaptor, so having a usable PCIe slot for a dedicated card was also a requirement. It is important to check whether PCIe slots can still be used when all NVME slots are occupied; sometimes they internally share the bandwidth. But for the above board this is not the case.
Important: To be able to boot FreeBSD on this board, you need to add the following to /boot/device.hints
:
hint.uart.0.disabled="1"
hint.uart.1.disabled="1"
For the CPU, I just went with something on the low TDP end of the current Intel CPU range, the Intel Core i3-12100T. Four cores + four threads was exactly what I was looking for, and 35W TDP sounded good. I paired that with some off-the-shelf 32GiB RAM kit.
Case, power supply & cooling
Strictly speaking a 2U case would have been sufficient, but I thought a 3U case might offer better air circulation. I ended up with the Gembird 19CC-3U-01. For unknown reasons, I chose a 2U horizontal CPU fan, instead of a 3U one. The latter would definitely have provided better airflow, but since the fan barely runs at all, it doesn’t make much of a difference.
I was unsuccessful in finding a good PSU that is super efficient in the average case of around 40W power usage but also covers spikes well above 100W, so I just chose the cheapest 300W one I could get :)
The case with everything in place.
The built in fans are very noisy. I chose to replace one of the intake fans with a spare one I had lying around and only connect one of the rear outtake fans. But I added an extra fan where the extension slots are to divert some airflow around the NIC—which otherwise gets quite warm. This should also blow some air over the NVME heatsinks! All fans can be regulated and fine-tuned from the BIOS of the mainboard which I totally recommend you do. At the current temperatures and average workloads the whole setup is almost silent.
Storage
Now, the fun begins. Since I needed more space than before, I clearly want a 3+1 x 4TiB RAIDZ1.
My goal was to be able to saturate a 10GBit connection (so get around 1GiB/s throughput) and still have the server be able to serve the Nextcloud without slowing down significantly. Currently the WAN upload is quite slow, but I hope to have fibre in the future. In any case, I thought that any modern NVME should be fast enough, because they all advertise speeds of multiple GiB/s.
Choice of disks
Anyway, I got two Crucial P3 Plus 4TB (which were on sale at Amazon for ~190€), as well as two Lexar NM790 4TB (which were also a lot cheaper than they are now). My assumption that that they were very comparable, was very wrong:
Disk | IOPS rand-read | IOPS read | IOPS write | MB/s read | MB/s write | “cat speed” MB/s |
---|---|---|---|---|---|---|
Crucial | 53,500 | 794,000 | 455,000 | 2,600 | 4,983 | ~700 |
Lexar | 53,700 | 796,000 | 456,000 | 4,578 | 5,737 | ~2,700 |
I used this fellow’s fio-script to
generate all columns except the last. The last column was generated by simply cat’ing a 10GiB file of random numbers to /dev/null
which
roughly corresponds to the read portion of copying a 4k movie file.
Since I had two disks each, I actually took the time to test all of them in different mainboard slots, but the results
were very consistent: in real-life tasks, the Crucial disk underperformed significantly, while the Lexar disks were
super fast.
I decided to return the Crucial disks and get two more by Lexar 😎
Disk encryption
I always store my data encrypted at rest. FreeBSD offers GELI block-level encryption (similar to LUKS on Linux). But OpenZFS also provides a dataset/filesystem-level encryption since a while. I previously used GELI, but I wanted to switch to ZFS native encryption, because it provides some advantages:
- Flexibility: I can choose later which datasets to encrypt; I can encrypt different datasets with different keys.
- Zero-knowledge backups: I can send incremental backups off-site that are received and fully integrated into the target pool without that server ever getting the decryption keys.
- Forward-compatibility: I can upgrade to better encryption algorithms later.
- Linux-compatibility: I can import the existing pool in a Linux environment for debugging or benchmarking.
However, I had also heard that ZFS native encryption was slower, so I decided to do some benchmarks:
Disk | IOPS rand-read | IOPS read | IOPS write | MB/s read | MB/s write | “cat speed” MB/s |
---|---|---|---|---|---|---|
no encryption | 54,700 | 809,000 | 453,000 | 4,796 | 5,868 | 2,732 |
geli-aes-256-xts | 40,000 | 793,000 | 446,000 | 3,332 | 3,334 | 952 |
zfs-enc-aes-256-gcm | 26,100 | 513,000 | 285,000 | 3,871 | 4,648 | 2,638 |
zfs-enc-aes-128-gcm | 29,300 | 532,000 | 353,000 | 3,971 | 4,794 | 2,631 |
Interestingly, GELI3 performs much better on the IOPS, but much worse on throughput, especially on our real-life test case. Maybe some smart person knows the reason for this, but I took this benchmark as an assurance that going with native encryption was the right choice.4 One reason for the good performance of the native encryption seems to be that it makes use of the CPU’s avx2 extensions.
At this point, I feel like I do need to warn people about some ZFS encryption related issues that I learned about later. Please read this. I have had no problems to date, but make up your own mind.
RaidZ1
recordsize | compr. | encrypt | IOPS rand-read | IOPS read | IOPS write | MB/s read | MB/s write | “cat speed” MB/s |
---|---|---|---|---|---|---|---|---|
128 KiB | off | off | 50,000 | 869,000 | 418,000 | 3,964 | 5,745 | 2,019 |
128 KiB | on | off | 49,800 | 877,000 | 458,000 | 3,929 | 4,654 | 1,448 |
128 KiB | off | aes128 | 26,300 | 484,000 | 230,000 | 3,589 | 5,331 | 2,142 |
128 KiB | on | aes128 | 27,400 | 501,000 | 228,000 | 3,510 | 3,927 | 2,120 |
These are the numbers after creation of the RAIDZ1 based pool. They are quite similar to the numbers measured before.
The impact of encryption on IOPS is clearly visible, less so on sequential read/write throughput.
Compression seems to impact write throughput but not read throughput which is expected for zstd
. It is unclear why
“cat speed” is lower here.
recordsize | compr. | encrypt | IOPS rand-read | IOPS read | IOPS write | MB/s read | MB/s write | “cat speed” MB/s |
---|---|---|---|---|---|---|---|---|
1 MiB | off | off | 7,235 | 730,000 | 404,000 | 3,686 | 3,548 | 2,142 |
1 MiB | on | off | 7,112 | 800,000 | 470,000 | 3,624 | 3,447 | 2,064 |
1 MiB | off | aes128 | 3,259 | 497,000 | 258,000 | 3,029 | 3,422 | 2,227 |
1 MiB | on | aes128 | 3,697 | 506,000 | 249,000 | 3,137 | 3,361 | 2,237 |
Many optimisation guides suggest setting the zfs recordsize
to 1 MiB for most use-cases, especially storage of media
files.
But this seems to drastically penalise random read IOPS while providing little to no benefit in the sequential
read/write scenarios. This is actually a bit surprising and I will need to investigate this more.
Is it perhaps because NVMEs are good at parallel access and therefor suffer less from fragmentation anyway?
In any case, the main take away message is that overall read and write throughputs are over 3,000 MiB/s in the synthetic case and over 2,000 MiB/s in the manual case, which is great.
Other disk performance metrics
Operation | Speed [MiB/s] |
---|---|
Copying 382 GiB between two datasets (both enc+comp) | 1,564 |
Copying 505 GiB between two datasets (both enc+comp) | 800 |
zfs scrub of the full pool |
11,000 |
These numbers further illustrate some real world use-cases. It’s interesting to see the difference between the first two, but it’s also important to keep in mind that this is reading and writing at the same time. Maybe some internal caches are exhausted after a while? I didn’t debug these numbers further, but I think the speed is quite good after such a long read/write.
More interesting is the speed for scrubbing, and, yes, I have checked this a couple of times. A scrub of 6.84TiB happens in 10m - 11m, which is pretty amazing, I think, considering that it is reading the data and calculating checksums. I am assuming that sequential read is just very fast and that access to the different disks happens in parallel. The checksum implementation is apparently also avx2 optimised.
LAN
Network adapter
Based on recommendations, I decided to buy an Intel card. Cheaper 10GBit network cards are available from Marvell/Aquantia, but the driver support in FreeBSD is poor, and the performance is supposedly also not close to that of Intel.
Many people suggested I go for SFP+ (fibre) instead of 10GBase-T (copper), but I already have CAT7 cables in my flat. While I could have used fibre purely for connecting the server to the switch (and this would likely save some power), I would have had to buy a new switch and the options were just not economical—I already have a switch with two 10GBase-T ports which I had bought for exactly this setup.
The cheapest Intel 10GBase-T card out there is the X540 which is quite old and available on Amazon for around 80€. I bought two of those (one for the server and one for the workstation). More modern cards are supposedly more energy efficient, but also a lot more expensive.5
NFS Performance
On server and client, I set:
kern.ipc.maxsockbuf=4737024
in/etc/sysctl.conf
mtu 9000 media 10gbase-t
in the/etc/rc.conf
(ifconfig)
Only on the server:
nfs_server_maxio="1048576"
in/etc/rc.conf
Only on the client:
nfsv4,nconnect=8,readahead=8
as the mount options for the nfs mount.vfs.maxbcachebuf=1048576
in/boot/loader.conf
(not sure any more if this makes a difference).
These settings allow larger buffers and increase the amount of readahead. This favours large sequential reads/writes over small random reads/writes.
The full options on the client end up being:
# nfsstat -m
X.Y.Z.W:/ on /mnt/server
nfsv4,minorversion=2,tcp,resvport,nconnect=8,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=1048576,wsize=1048576,readdirsize=1048576,readahead=8,wcommitsize=16777216,timeout=120,retrans=2147483647
I use NFS4 for my workstation and NFS3 for everyone else. I have performed no benchmarks on NFS3, but I see no reason why it would be slower.
IOPS rand-read | IOPS read | IOPS write | MB/s read | MB/s write | “cat speed” MB/s |
---|---|---|---|---|---|
283 | 292,000 | 33,200 | 1,156 | 594 | 1,164 |
This benchmark was performed on a dataset with 1M recordsize, encryption, but no compression.
Random read IOPS are pretty bad, and I see a strong correlation here to the rsize
(if I halve it, I double the IOPS; not shown in table).
It’s possible that every 4KiB read actually triggers a 1MiB read in NFS which would explain this.
On the other hand, the sequential read and write performance is pretty good with synthetic and real world read speeds
being very close to the theoretical maximum of the 10GBit connection.
One thing to keep in mind: The blocksize when reading has a very strong impact on the performance. This
can be seen when using dd
with different bs
arguments. Of course, 1MiB is optimal if that is also used by NFS, and
cat
seems to do this. However, cp
does not which results in a much slower performance than if using dd if=.. of=.. bs=1M
.
I have done measurements with plain nc
over the wire (also reaching 1,160 MiB/s) and iperf3
which achieves 1,233 MiB/s just below the 1,250 MiB/s equivalent of 10Gbit.
Power consumption and thermals
For a computer running 24/7 in my flat, power consumption is of course important. I bought a device to measure power consumption at the outlet to get an accurate picture.
idle
Because the computer is idle most of the time, optimising idle power usage is most important.
Change | W/h |
---|---|
default | 50 |
*_cx_lowest="Cmax" |
45 |
disable WiFi and BT | 42 |
media 10gbase-t |
45 |
machdep.hwpstate_pkg_ctrl=0 |
41 |
turn on chassis fans | 42 |
ASPM modes to L0s+L1 / enabled | 34 |
I assume that the same setup on Linux would be slightly more efficient, but 34W in idle is acceptable.
Clearly, the most impactful changes were:
- Activating ASPM for the PCIe devices in the BIOS.
- Adding
performance_cx_lowest="Cmax"
andeconomy_cx_lowest="Cmax"
to/etc/rc.conf
. - Adding
machdep.hwpstate_pkg_ctrl=0
to/boot/loader.conf
.
You can find online resources on what these options do. You might need to update the BIOS to be able to disable
WiFi and Bluetooth devices completely. You can also use hints in the /boot/device.hints
, but this doesn’t save
as much power.
Using 10GBase-T speed on the network device (instead of 1000Base-T) unfortunately increases power usage notably, but there is nothing I could find to mitigate this.
Things that are often recommended but that did not help me (at least not in idle):
- NVME power states (more on this below)
- lower values for
sysctl dev.hwpstate_intel.*.epp
(more on this below) hw.pci.do_power_nodriver=3
idle temperatures | °C |
---|---|
CPU | 37-40 |
NVMEs | 52-55 |
The latter was particularly interesting, because I had heard that newer NVMEs, and especially those by Lexar get very warm. It should be noted though, that the mainboard comes with a large heatsink that covers all NVMEs.
under load
The only “load test” that I performed was a scrub of the pool. Since this puts stress on the NVMEs and also the CPUs, it should be at least indicative of how things are going.
during zpool scrub |
°C |
---|---|
CPU | 55-59 |
NVMEs | 69-75 |
The power usage fluctuates between 85W and 98W. I think all of these values are acceptable.
NVME power state hint | scrub speed GiB/s | W/h |
---|---|---|
0 (default) | 11 | < 100 |
1 | 8 | < 93 |
2 | 4 | < 70 |
You can use nvmecontrol
to tell the NVME disks to save energy. More information on this here
and here.
I was surprised that all of this works reliably on FreeBSD, but it does! The man-page is not great though. Simply
call nvmecontrol power -p X nvmeYns1
to set the hint to X on device Y, if desired. Note that this needs to be repeated after
every reboot.
dev.hwpstate_intel.*.epp |
scrub speed GiB/s | W/h |
---|---|---|
50 (default) | 11.0 | < 100 |
100 | 3.3 | < 60 |
You can use the dev.hwpstate_intel.*.epp
sysctls for you cores to tune the eagerness of that core to scale up with
higher number meaning less eagerness.
In the end, I decided not to apply any of these “under load optimisations”. It is just very difficult, because, as shown, all optimisations that reduce watts per time also increase time. I am not certain of any good ways to quantify this, but it feels like keeping the system at 70W for 30min instead of 100W for 10min, is not really worth it. And I kind of also want the system to be fast, that’s why I spent so much money on it 🙃
The CPU does have a cTDP mode that can be activated via the BIOS and which is “worth it”, according to some articles I have read. I might give this a try in the future.
Final remarks
What a ride! I spent a lot of time optimising and benchmarking this and I am quite happy with the outcome. I am able to exhaust the 10GBit LAN connection completely, and still have resources left on the server :)
Thanks to the people at www.bsdforen.de who had quite a few helpful suggestions.
If you see anything that I missed, or have suggestions on how to improve this setup, let me know in the comments!
Footnotes
-
With ASUS being the only exception. ↩︎
-
Proper in this context means well-supported by FreeBSD and with a good performance. Usually, that means an Intel NIC. Unfortunately all the modern boards come Marvell/Aquantia AQtion adaptors which are not well-supported by FreeBSD. ↩︎
-
The geli device was created with:
geli init -b -s4096 -l256
↩︎ -
I wanted to perform all these tests with Linux as well, but I ran out of time 🙈 ↩︎
-
I did try a slightly more more modern adapter with Intel 82599EN chip. This is a SFP+ chip, but I found an adaptor with built-in 10GBase-T for around 150€. It ended up having some driver issues (you needed to plug and unplug the CAT cable for the device to go UP), and it used more energy than the X540, so I sent it back. ↩︎
Interact
💬, ❤ or ♺ on MastodonComments (via git.fsfe.org)
Comments (via GitHub)