In this blog post we detail how we helped our partner Reperio optimize high-performance storage solutions for database workloads that rely heavily on small block synchronous writes. Working together, we tested and compared different RAID configurations to address the critical I/O performance challenges that databases like CockroachDB, PostgreSQL, and CouchDB face with their 4k synchronous write operations.
About Reperio
Reperio helps businesses improve technology and collaboration through foundational solutions. They offer flexible, no-commitment services for clients in the US ranging from home offices to large organizations. Their expertise includes cloud virtualization, high-availability system design, databases, horizontal scaling, open source infrastructure (preventing vendor lock-in), and delivering enterprise-grade performance within budget constraints.
Background and Challenges
Many essential software platforms (including relational databases like PostgreSQL, MySQL, and Microsoft SQL Server, as well as distributed databases like CockroachDB and CouchDB) rely heavily on small block synchronous writes to ensure data durability and consistency. These systems depend on the fsync() system call to persist critical data to disk, making sure that each transaction is safely recorded. The size of these writes is often as small as 4k, and production data from CockroachDB environments confirms that such small, random synchronous writes make up a significant portion of total I/O. This I/O pattern represents the most demanding and costly test for any storage system, both in terms of raw performance and hardware resource consumption. Without properly optimized storage, these expensive IOPs can throttle overall application throughput and reliability. Testing and tuning for random 4k synchronous writes is not optional, it’s essential for maximizing database performance. For more details, see CockroachDB’s guidance on disk stalls, PostgreSQL's WAL configuration, and how CouchDB uses fsync to prevent data corruption.
We are using CockroachDB as an example database here, but the results apply to most of them. Reperio have measured their CDB deployments and defined the test loads used here.
CockroachDB’s writes block sizes distribution looks like the following, ordered from most to least observed:
- 4k: 55%
- 8k: 23%
- 16k: 16.8%
- 32k+: 5.2%

iostat block size pie chart

iostat block size histogram
Double replication
Storage systems replicate data to multiple nodes (an example of this would be CEPH or VMware vSAN), so every write operation is effectively multiplied on a network layer, compounding I/O load and drastically inflating both performance overhead and infrastructure costs.
However, this is not actually required for many modern popular data applications. Technologies like CockroachDB, CouchDB, and Elasticsearch deliver just that by embedding replication and high availability directly into the application layer. These self-replicating, self-balancing platforms are engineered to handle fault tolerance and data distribution on their own. Therefore a simple local RAID array handles node-level redundancy sufficiently, and the need for cluster-level redundancy of a complicated storage system is eliminated by using more efficient application level features.
Reperio helps clients avoid this unnecessary duplication by architecting solutions that respect the intelligence built into today's software stacks—ensuring optimal IOPS efficiency, lower latency, and significant savings on both hardware and operational expenses.
Final load definition
Based on all of the above, we tested the load in the worst case scenario: 4k synchronous IO. This means that the numbers listed in this study are the guaranteed minimums for the used setup, and in real-world scenarios they might actually be even better.
At the same time, we are making smart architecture choices to avoid unnecessary overhead without sacrificing production reliability.
Testing environment
Platform: Supermicro AS -2125HS-TNR
CPU: 2 * AMD EPYC 9454 48-Core (192 threads) Processor
RAM: 24/24 SAMSUNG M321R4GA3BB6-CQKET
Drives: 8 * SanDisk DC SN861 SDS6BA138PSP9X3
OS: Rocky Linux 9.4
RAID Engine: xiRAID Classic, MDADM
Some findings:
-
Single thread matters: Reperio first tested the same load on a system with 2.1GHz CPUs, getting roughly half of our initial Random Write performance even before any system tuning. Results listed here were obtained on a 3.8GHz CPUs, so almost 2 times higher clock speed. After manually limiting our CPU frequency (using the cpupower utility), performance dropped to the same level as on the first system. This shows that synchronous Random Writes are often single thread bound, requiring a relatively high clock speed and IPC more than a multitude of cores. Intel vs AMD IPC differences don’t seem to matter much, as shown by our tests above (Reperio used Intel Xeon GOLD 6530, we used AMD EPYC 9454, but clock-to-clock they performed very similarly), meaning it mostly boils down to clock speed. Nonetheless, IPC varies generation to generation, so the last point might not stand forever.
Also beware of Turbo boost conditions your CPUs require to sustain their high speed mode of operation. Some CPUs may advertise very high boost clock speeds, but in reality provide them for a limited time with very specific requirements, e.g. temperature below 80 degrees Celsius, only for a single threaded load and so on. The number that matters is the All Cores Boost. - Memory throughput matters: our initial tests were done with only 8 out of 24 memory channels populated (4 out of 12 per CPU), which created a big Random I/O bottleneck (not 3 times less performance, but still quite a lot worse results). Adding memory to utilize all supported channels greatly increased the IOPS numbers and produced the results below.
xiRAID configuration
RAID10 has more storage overhead than parity RAID levels (like 5 or 6) because it sacrifices half of raw capacity for redundancy, but that gives it superb reliability: half of the drives in the array can fail before data loss and service interruption, as long as they are not all in the same group. However, RAID10 also has “just“ 2 times penalty for Random Writes, as there are no checksums to calculate (50% of the theoretical maximum performance of the drives). In RAID 6, each small random write typically involves 3 write operations (1 for data and 2 for parity) and 3 read operations (to fetch the old data and parity). In this case, for RAID6 only about 33% of the raw drive performance is achieved for random writes due to read-modify-write operations and the overhead of excessive parity calculations.
Here is an array we tested:
- 8 drives
- 2 drives per group (effectively in RAID1)
- 4 groups (effectively RAID0 on 4 * RAID1)
- Chunk Size: 16K
Command:
xicli raid create -n media10 -l 10 -ss 16 -d /dev/nvme11n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme1n1 /dev/nvme3n1 /dev/nvme5n1 /dev/nvme7n1

MDADM configuration
- 8 drives
- layout=n2, we are mirroring our xiRAID setup with 4 groups of 2 drives each
- Chunk Size: 16K
Command:
mdadm --create --verbose --chunk=16 /dev/md0 --level=10 --layout=n2 --raid-devices=8 /dev/nvme11n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme1n1 /dev/nvme3n1 /dev/nvme5n1 /dev/nvme7n1
md0 : active raid10 nvme7n1[7] nvme5n1[6] nvme3n1[5] nvme1n1[4] nvme16n1[3] nvme15n1[2] nvme14n1[1] nvme11n1[0] 11845584000 blocks super 1.2 16K chunks 2 near-copies [8/8] [UUUUUUUU] bitmap: 0/89 pages [0KB], 65536KB chunk
Benchmark Results
We ran the tests step by step, moving from using a single thread to all available 192 threads.
We still place all runs for Random loads on RAID arrays in a separate table along with graphs based on the values within. This is done to visually showcase how the solution scales with threads and system resources.
Emphasizing again, we are using the psync I/O engine, meaning all operations run synchronously with an I/O depth of 1. FIO does not write/read any blocks in the thread until the last operation is completed on the drive. This is to match how real world applications utilize the storage subsystem.
RAW drives
Before beginning RAID level testing, we measured the performance of the raw drives to ensure there were no bottlenecks and to establish a performance baseline. This also helps users understand what the raw IOP per dollar for the system would be, compared to the realized IOP per dollar with RAID.
Load | Jobs (per drive) | Result | Avg Latency |
---|---|---|---|
Random Write | 96(12) | 3M IOPS | 20.27 |
Random Read | 192(24) | 2.7M IOPS | 68.08 μs |
Random Mix 70% Write |
192(24) | Write: 2.6M IOPS Read: 1.1M IOPS |
Write: 23.36 μs Read: 117.13 μs |
Random Mix 70% Read |
192(24) | Write: 773k IOPS Read: 1.8M IOPS |
Write: 36.15 μs Read: 90.43 μs |
xiRAID psync random read/write performance measurements
The xiRAID testing was conducted according to the methodology outlined in the appendix, with drives preconditioning prior to testing.
In addition to the results shown in the table below, we also collected performance data for mixed random write and read workloads in two proportions: 70% random writes / 30% random reads and 30% random writes / 70% random reads. The results of these mixed workloads are presented in the appendix.
Jobs | Random read (kIOPS) | Avg Read Latency (µs) | Random write (kIOPS) | Avg write Latency (µs) |
---|---|---|---|---|
1 | 14.0k | 71.02 | 58.4k | 17.04 |
4 | 56.8k | 70.93 | 217k | 17.99 |
16 | 216k | 73.2 | 834k | 18.07 |
32 | 420k | 75.38 | 1376k | 20.09 |
64 | 820k | 77.32 | 1410k | 29.39 |
96 | 1221k | 77.85 | 1450k | 36.19 |
128 | 1602k | 79.02 | 1398k | 49.73 |
160 | 1933k | 80.82 | 1405k | 74.94 |
192 | 2355k | 80.04 | 1436k | 124.35 |
Here is the combined chart visualizing both IOPS and latency for random read and write operations:

Key observations
- As shown in the graphs, xiRAID demonstrates excellent random write performance in psync, while maintaining low latency even under heavy load.
- Random read performance remains stable across all load levels, with consistently low latency as well.
MDADM/MDRAID psync random read/write performance measurements
mdadm/mdraid is the most commonly used software RAID solution in Linux environments. It is particularly popular for building storage systems for databases.
For this reason, we decided to perform comparative testing of mdadm/mdraid using the default configuration with the --bitmap option enabled. While this option significantly reduces random write performance, it more accurately reflects real-world, production-like configurations.
Jobs | Random read (kIOPS) | Avg Read Latency (µs) | Random write (kIOPS) | Avg write Latency (µs) |
---|---|---|---|---|
1 | 14.7k | 67.56 | 21.49k | 21.49 |
4 | 58.0k | 68.38 | 148k | 26.53 |
16 | 231k | 68.55 | 250k | 63.37 |
32 | 459k | 68.9 | 285k | 111.3 |
64 | 904k | 69.83 | 278k | 226.26 |
96 | 1342k | 70.8 | 267k | 357.57 |
128 | 1737k | 72.4 | 232k | 528.3 |
160 | 2113k | 74.32 | 246k | 619.43 |
192 | 2464k | 76.22 | 247k | 755.63 |
Here is the combined chart visualizing both IOPS and latency for random read and write operations:

Key observations
- Random read scales virtually linearly with increasing job count with relatively small growth in latency.
- Random write performance is limited to about 250kIOPS and increasing job count leads to greater latency.
Optimal performance figures
Here we list the results making the most sense based on resources/performance ratio as optimal* ones and ones directly comparable between tested RAID engines as a reference point.
We consider the result optimal when it delivers the best scaling for the number of jobs used. As an example, if 4 jobs deliver X GB/s, while 8 jobs (2 times more CPU threads) deliver X+0.1GB/s, the 4 job run is picked as optimal and gets listed, while 8 job one is omitted from the table.
Load | Jobs | mdraid Result | mdraid Avg Latency | xiRAID Result | xiRAID Avg Latency |
---|---|---|---|---|---|
Random Write | 32 | 285k IOPS | 111.3 μs | 1.3M IOPS | 20.09 μs |
Random Write | 64 | 278k IOPS | 226.26 μs | 1.4 M IOPS | 29.39 μs |
Random Read | 192 | 2.5M IOPS | 76.22 μs | 2.4M IOPS | 94.06 μs |
Random Mix 70% Write |
32 | Write: 247k IOPS Read: 106k IOPS |
Write: 96.57 μs Read: 73.53 μs |
Write: 494k IOPS Read: 212k IOPS |
Write: 28.77 μs Read: 81.84 μs |
Random Mix 70% Write |
128 | Write: 248k IOPS Read: 106k IOPS |
Write: 480.69 μs Read: 74.66 μs |
Write: 1408k IOPS Read: 603k IOPS |
Write: 37.76 μs Read: 112.2 μs |
Random Mix 70% Read |
96 | Write: 227k IOPS Read: 529k IOPS |
Write: 246.25 μs Read: 74.3 μs |
Write: 412k IOPS Read: 962k IOPS |
Write: 32.85 μs Read: 84.46 μs |
Random Mix 70% Read |
128 | Write: 226k IOPS Read: 527k IOPS |
Write: 388.43 μs Read: 74.42 μs |
Write: 1234k IOPS Read: 529k IOPS |
Write: 34.39 μs Read: 87.61 μs |
Overall Comparison (mdraid vs. xiRAID)




- The performance data highlights xiRAID's superior efficiency. In RAID10, we deliver excellent random write performance (1.4 M IOPS vs 278k IOPS on mdadm), making it highly suitable for latency-sensitive workloads such as database transactions. Keep in mind that all of these numbers are for synchronous operations, which are often way smaller than what is theoretically possible. Indeed, real world applications are usually a lot more sensitive to Random Writes, and we really excel there.
- Random Write Performance: xiRAID significantly outperforms mdraid in random write operations, both in terms of throughput (KIOPS) and latency. xiRAID's ability to handle high concurrent write loads with relatively low and stable latency is a major advantage.
- Random Read Performance: Both mdraid and xiRAID show strong random read performance with good KIOPS scaling and relatively low latency. While mdraid appears to scale slightly higher in KIOPS for random reads at the tested job counts, both are highly efficient for this workload. However, xiRAID maintains consistently low latency for reads even at higher job counts.
Conclusion
This collaborative study brings together the expertise of Xinnor and Reperio to analyze modern database workloads. Our research addresses a notable gap in published literature on this widely used scenario, offering valuable insights into optimizing database infrastructure.
Performance testing demonstrated outstanding xiRAID results across various configurations for synchronous read and write operations, under both sequential and random I/O patterns.
RAID10 implementations, in particular, delivered excellent random write performance—reaching 1.4 million IOPS—while maintaining consistently low latency. These characteristics make RAID10 well-suited for latency-sensitive database transactions.
The architectural analysis revealed that eliminating redundant replication can significantly boost performance for workloads that already implement cluster-wide data redundancy. At the same time, RAID continues to provide essential local storage redundancy, preventing service interruptions in the event of a device failure.
Hardware tuning also highlighted that single-threaded CPU performance—often underestimated—is critical for database workloads. Additionally, RAM bandwidth is becoming an increasingly important factor, especially when using PCIe Gen5 NVMe SSDs.
Appendix 1
Mixed workload for mdraid rwmixread=70
Jobs | Type | IOPS | Avg Latency (µs) |
---|---|---|---|
1 | r | 11.5k | 68.01 |
w | 5k | 43.03 | |
4 | r | 44.8k | 69.92 |
w | 19.3k | 43.04 | |
16 | r | 182k | 70.62 |
w | 78.0k | 38.26 | |
32 | r | 342k | 71.84 |
w | 147k | 47.64 | |
64 | r | 489k | 73.6 |
w | 210k | 130.05 | |
96 | r | 529k | 74.3 |
w | 227k | 246.25 | |
128 | r | 527k | 74.42 |
w | 226k | 388.43 |
Mixed workload for mdraid rwmixwrite=70
Jobs | Type | IOPS | Avg Latency (µs) |
---|---|---|---|
1 | r | 6k | 32.57 |
w | 16k | 68.01 | |
4 | r | 27.4k | 70.67 |
w | 64.1k | 31.32 | |
16 | r | 86.8k | 72.63 |
w | 203k | 46.84 | |
32 | r | 106k | 73.53 |
w | 247k | 96.57 | |
64 | r | 119k | 74.11 |
w | 278k | 197.08 | |
96 | r | 106k | 74.72 |
w | 248k | 352.89 | |
128 | r | 106k | 74.66 |
w | 248k | 480.69 |
Mixed workload for xiRAID rwmixread=70
Jobs | Type | IOPS | Avg Latency (µs) |
---|---|---|---|
1 | r | 12.5k | 70.96 |
w | 5k | 19.09 | |
4 | r | 50.2k | 71 |
w | 21.6k | 18.37 | |
16 | r | 188k | 74.19 |
w | 80.6k | 23.21 | |
32 | r | 344k | 78.25 |
w | 148k | 31.49 | |
64 | r | 664k | 81.29 |
w | 284k | 32.67 | |
96 | r | 962k | 84.46 |
w | 412k | 32.85 | |
128 | r | 1234k | 87.61 |
w | 529k | 34.39 |
Mixed workload for xiRAID rwmixwrite=70
Jobs | Type | IOPS | Avg Latency (µs) |
---|---|---|---|
1 | r | 8k | 70.35 |
w | 20.6k | 17.64 | |
4 | r | 34.4k | 71.85 |
w | 80.3k | 18.37 | |
16 | r | 116k | 77.33 |
w | 270k | 25.33 | |
32 | r | 212k | 81.84 |
w | 494k | 28.77 | |
64 | r | 390k | 88.77 |
w | 909k | 31.33 | |
96 | r | 550k | 95.28 |
w | 1282k | 32.88 | |
128 | r | 603k | 112.2 |
w | 1408k | 37.76 |
Benchmark Methodology
Testing Tools and Parameters
We are doing our tests using the FIO utility. Full job definitions are available in the next section, here we will discuss the most important ones:
- numjobs is the number of load-generating threads spawned by FIO.
- iodepth is the number of blocks in flight at the same time, used in preconditioning.
- ioengine=psync is the engine used to apply load. psync performs all operations in a synchronous manner, practically with iodepth=1, so setting any other value won’t have any effect.
- direct=1 bypass Linux kernel page cache/buffer, all operations go straight to the drive.
- blocksize=4K we are testing worst case scenario here, as this is the bulk of CockroachDB load described above.
- numa_cpu_nodes=0 and numa_mem_policy=local restrict CPU cores and RAM respectively to the NUMA nodes the drives are physically connected to. Same idea as xiRAID CPU allow parameter.
- runtime=120 we are running pretty short tests here, but thanks to preconditioning the whole block device for hours beforehand, we are getting the actual sustained results. Running the tests themselves much longer will not add significant value.
- exitall=1 terminates all jobs as soon as the first job finishes. Without this the results may become less representative as block device load lowers towards the end with less and less jobs running. Preconditioning may also become drastically less efficient if the load at the end is unable to saturate drive controllers' Garbage Collector.
Let’s go a bit deeper into the last statement about drive Garbage Collector. Historically preconditioning was done by overwriting the target block device (drive/RAID) with target block size 2 times, using 1 job, and that was sufficient.
But controllers inside modern PCI-e Gen5 drives, including our SanDisk DC SN861, have gained a lot of sophistication. This means that not only are 2 passes not always enough, but any idle time between preconditioning and actual tests will also result in controllers GC running rampant, inflating performance back close to Fresh Out Of the Box levels, and even running preconditioning with 1 job or in a synchronous manner may leave enough space for it to inflate performance back, partially or even completely negating preconditioning effects. And while higher performance is great, we are measuring sustained levels, as in worst case scenario, or always guaranteed levels.
Because of all this, we needed to increase preconditioning numjobs to 4 for Sequential and 8 for Random, loops to 4, use ioengine=libaio with iodepth=128 and also automate our test runs in order to have minimal intervals between preconditioning and actual load.
The scripts we used and their descriptions are available in the Appendix 2.
Our goal here is to get as close to RAW drives performance as possible with as little resources as possible.
FIO patterns
RAW
Random
Preconditioning
[job 0] filename=/dev/nvme0n1 [job 1]
...
Test
[job 0] filename=/dev/nvme0n1 [job 1] ...
RAID arrays
Random
Preconditioning
[job1] filename=/dev/xi_test
Test
[job1] filename=/dev/xi_test
Appendix 2
Testing scripts
#RAW drives ~/fio/fio jobs/raw-seq-precond.job for jobs in 16 12 8 4 1; do ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raw-seq-write.job -o ./logs/sustained/raw-seq-write-${jobs}.log done for jobs in 16 12 8 4 1; do ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raw-seq-read.job -o ./logs/sustained/raw-seq-read-${jobs}.log done ~/fio/fio jobs/raw-rand-precond.job for jobs in 16 12 8 4 1; do ~/fio/fio --numjobs=$jobs jobs/raw-rand-write.job -o ./logs/sustained/raw-rand-write-${jobs}.log done for jobs in 16 12 8 4 1; do ~/fio/fio --numjobs=$jobs jobs/raw-rand-read.job -o ./logs/sustained/raw-rand-read-${jobs}.log done
#RAID ~/fio/fio jobs/raid-seq-precond.job for jobs in 96 64 48 32 16 8 4 1; do ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raid-seq-write.job -o ./logs/sustained/raid-seq-write-${jobs}.log done for jobs in 96 64 48 32 16 8 4 1; do ~/fio/fio --numjobs=$jobs --offset_increment=$(bc <<< "100/$jobs")% jobs/raid-seq-read.job -o ./logs/sustained/raid-seq-read-${jobs}.log done ~/fio/fio jobs/raid-rand-precond.job for jobs in 96 64 48 32 16 8 4 1; do ~/fio/fio --numjobs=$jobs jobs/raid-rand-write.job -o ./logs/sustained/raid-rand-write-${jobs}.log done for jobs in 96 64 48 32 16 8 4 1; do ~/fio/fio --numjobs=$jobs jobs/raid-rand-read.job -o ./logs/sustained/raid-rand-read-${jobs}.log done
Let’s go through them:
- First we define device type, RAW or RAID. This is just to keep difference between the scripts minimal and defined mostly in one place.
- Than we run load preconditioning.
- Next we decide on the numbers of jobs to run based on the device type. If we are testing RAW drives, the number applies to each drive, so a test with jobs=1 on 8 drives will produce 8 jobs; for RAID device the value will apply to all drives under the array together, jobs=32 will produce 32 jobs total. We run the tests in a loop, going from more jobs to less.
- After that we go through loads. The first one is Write, because it is the one actually affected by preconditioning and GC for the most part, Read – not so much.
offset_increment is calculated using GNU bc by dividing 100 (percent of the block device) by the number of jobs, avoiding simultaneous writes to the same sector by multiple jobs. bc rounds floating point numbers down to the nearest integer by default, which suits our needs.