A couple of weeks ago I came across a question why NVMe is slower on one server than SATA on another. I looked at the server characteristics and realized that it was a trick question: NVMe was from the user segment and SSD was from the server segment.
Obviously, it is not correct to compare products from different segments in different environments, but this is not an exhaustive technical answer. Let’s study the basics, conduct experiments and give an answer to the question posed.
What is fsync and where is it used?
In order to speed up the work with the drives, the data is buffered, i.e. it is stored in the non-volatile memory until there is an opportunity to save the contents of the buffer to the drive. The criteria for a “convenient occasion” are determined by the operating system and the characteristics of the drive. If power is lost, all data in the buffer will be lost.
There are a number of tasks to make sure that changes to a file are written to the drive, rather than lying in the intermediate buffer. This confidence can be obtained by using a POSIX-compatible fsync system call. An fsync call initiates a forced write from the buffer to the drive.
What is affected by frequent use of fsync?
With normal I/O, the operating system tries to optimize communication with the drives, since external drives are the slowest in the memory hierarchy. The operating system therefore tries to write as much data as possible in a single access to the drive.
Let us demonstrate the impact of fsync on a concrete example. We have the following SSDs as the test subjects:
- Intel® DC SSD S4500 480 GB, connected via SATA 3.2, 6 Gbps;
- Samsung 970 EVO Plus 500GB, connected via PCIe 3.0 x4, ~31 Gbps.
Tests are carried out on Intel® Xeon® W-2255 under Ubuntu 20.04. For disk testing sysbench 1.0.18 is used. One partition formatted as ext4 is created on the disks. Preparation for the test consists in creation of 100 GB files:
sysbench --test=fileio --file-total-size=100G prepare
Launching the tests:
# Without fsync sysbench --num-threads=16 --test=fileio --file-test-mode=rndrw --file-fsync-freq=0 run # With fsync after each record sysbench --num-threads=16 --test=fileio --file-test-mode=rndrw --file-fsync-freq=1 run
The test results are presented in the table:
|Intel S4500||Samsung 970 EVO+|
|Reading without fsync, MiB/s||5734,89||9028,86|
|Recording without fsync, MiB/s||3823,26||6019,24|
|Reading with fsync, MiB/s||37,76||3,27|
|Recording with fsync, MiB/s||25,17||2,18|
It is not difficult to notice that NVMe from the client segment confidently leads when the operating system decides how to work with disks and loses when fsync is used. Hence, two questions arise:
Why in a test without fsync the reading speed exceeds the physical channel bandwidth?
Why is an SSD from a server segment better able to handle a large number of fsync requests?
The answer to the first question is simple: sysbench generates files filled with zeros. Thus, the test was performed over 100 gigabytes of zeros. Since the data is very monotonous and predictable, various OS optimizations come into play and they significantly speed up the execution.
If you question all sysbench results, you can use fio.
# Without fsync fio --name=test1 --blocksize=16k --rw=randrw --iodepth=16 --runtime=60 --rwmixread=60 --fsync=0 --filename=/dev/sdb # With fsync after each record fio --name=test1 --blocksize=16k --rw=randrw --iodepth=16 --runtime=60 --rwmixread=60 --fsync=1 --filename=/dev/sdb
|Intel S4500||Samsung 970 EVO+|
|Reading without fsync, MiB/s||45,5||178|
|Recording without fsync, MiB/s||30,4||119|
|Reading with fsync, MiB/s||32,6||20,9|
|Recording with fsync, MiB/s||21,7||13,9|
NVMe’s performance slump trend with fsync is clearly visible. So you can skip to the answer of the second question.
Optimization or bluffing
Earlier we said that data is stored in a buffer, but we did not specify which one, because it was not fundamental. Even now, we will not go deep into the nuances of operating systems and will highlight two general types of buffers:
- hardware buffers.
By program buffer we mean the buffers that are in the operating system, and by hardware buffer we mean the energy-dependent memory of the disk controller. The system call fsync sends a command to the drive to write data from its buffer to the main storage, but it cannot control the correctness of the command.
Since the SSD shows better results, two assumptions can be made:
- the drive is designed to handle the load of a similar plan;
- the disk “bluffs” and ignores the command.
The drive’s dishonest behavior can be noticed if you run a power loss test. You can check this with the script diskchecker.pl which was created in 2005.
This script requires two physical machines, a “server” and a “client”. The client writes a small amount of data to the disk under test, calls fsync and sends information to the server about what was recorded.
# Starts at server ./diskchecker.pl -l [port] # Starts on client ./diskchecker.pl -s <server[:port]> create <file> <size_in_MB>
After running the script, disconnect the “client” and do not return power for several minutes. It is important to disconnect the tested person from the power supply, not just to perform a hard shutdown. After some time the server can be connected and loaded in the OS. After booting the OS you need to run diskchecker.pl again, but with the argument verify.
./diskchecker.pl -s <server[:port]> verify <file>
At the end of the check you will see the number of errors. If there are 0, then the drive has passed the test. You can repeat the experience a few times to avoid the drive’s success.
Our S4500 did not show any errors on power loss, that is, you can say that it is ready for loads with a large number of calls fsync.
When selecting disks or whole ready-made configurations, you should keep in mind the specific tasks to be solved. At first glance it seems obvious that NVMe, that is, SSD with PCIe interface, faster than the “classic” SATA SSD. However, as we have understood today, under specific conditions and with certain tasks it may not be so.